Linux 设备驱动程序,第三版

Linux 设备驱动程序,第三版

Linux Device Drivers, 3rd Edition

乔纳森·科贝特

Jonathan Corbet

亚历山德罗·鲁比尼

Alessandro Rubini

格雷格·克罗哈特曼

Greg Kroah-Hartman

由奥莱利媒体出版

北京 ⋅ 剑桥 ⋅ 法纳姆 ⋅ 科隆 ⋅ 塞巴斯托波尔 ⋅ 东京

Beijing ⋅ Cambridge ⋅ Farnham ⋅ Köln ⋅ Sebastopol ⋅ Tokyo

关于补充文件的注意事项

A Note Regarding Supplemental Files

本书的补充文件和示例可以在http://examples.oreilly.com/9780596005900/找到。请使用标准桌面网络浏览器访问这些文件,因为可能无法从所有电子阅读器设备访问这些文件。

Supplemental files and examples for this book can be found at http://examples.oreilly.com/9780596005900/. Please use a standard desktop web browser to access these files, as they may not be accessible from all ereader devices.

书中引用的所有代码文件或示例都可以在线获取。对于附带光盘的实体书,我们尽可能发布所有 CD/DVD 内容。请注意,虽然我们通过免费下载提供尽可能多的媒体内容,但我们有时会受到许可限制。如有任何问题或疑虑,请发送至

All code files or examples referenced in the book will be available online. For physical books that ship with an accompanying disc, whenever possible, we’ve posted all CD/DVD content. Note that while we provide as much of the media content as we are able via free download, we are sometimes limited by licensing restrictions. Please direct any questions or concerns to .

前言

Preface

从表面上看,这是一本关于为Linux系统编写设备驱动程序的书。当然,这是一个有价值的目标;新硬件产品的流动不太可能在短期内放缓,并且必须有人让所有这些新设备都可以与 Linux 配合使用。但这本书还介绍了 Linux 内核的工作原理以及如何根据您的需求或兴趣调整其工作原理。Linux是一个开放的系统;我们希望通过这本书,它能够更加开放,并且能够被更大的开发者社区所接受。

This is, on the surface, a book about writing device drivers for the Linux system. That is a worthy goal, of course; the flow of new hardware products is not likely to slow down anytime soon, and somebody is going to have to make all those new gadgets work with Linux. But this book is also about how the Linux kernel works and how to adapt its workings to your needs or interests. Linux is an open system; with this book, we hope, it is more open and accessible to a larger community of developers.

这是Linux 设备驱动程序的第三版。自从本书首次出版以来,内核已经发生了很大的变化,我们已经尝试改进文本以匹配。该版本尽可能完整地涵盖了 2.6.10 内核。这一次,我们选择忽略与先前内核版本的向后兼容性的讨论。与 2.4 相比的变化太大了,并且 2.4 界面在(免费提供的)第二版中仍然有详细记录。

This is the third edition of Linux Device Drivers. The kernel has changed greatly since this book was first published, and we have tried to evolve the text to match. This edition covers the 2.6.10 kernel as completely as we are able. We have, this time around, elected to omit the discussion of backward compatibility with previous kernel versions. The changes from 2.4 are simply too large, and the 2.4 interface remains well documented in the (freely available) second edition.

该版本包含大量与 2.6 内核相关的新材料。对锁定和并发性的讨论已得到扩展并移至其自己的章节中。详细介绍了 2.6 中新增的 Linux 设备模型。有关于USB总线和串行驱动子系统的新章节;PCI 章节也得到了增强。虽然本书其余部分的结构与早期版本相似,但每一章都经过了彻底更新。

This edition contains quite a bit of new material relevant to the 2.6 kernel. The discussion of locking and concurrency has been expanded and moved into its own chapter. The Linux device model, which is new in 2.6, is covered in detail. There are new chapters on the USB bus and the serial driver subsystem; the chapter on PCI has also been enhanced. While the organization of the rest of the book resembles that of the earlier editions, every chapter has been thoroughly updated.

我们希望您喜欢阅读这本书,就像我们喜欢写这本书一样。

We hope you enjoy reading this book as much as we have enjoyed writing it.

乔恩的简介

Jon's Introduction

本书的出版恰逢我使用 Linux 的第 12 个年头,令人震惊的是,这也是我在计算领域工作的第 25 个年头。早在 1980 年,计算似乎是一个快速发展的领域,但从那时起,一切都加快了很多。使Linux 设备驱动程序保持最新越来越成为一项挑战;Linux 内核黑客不断改进他们的代码,但他们对跟不上的文档缺乏耐心。

The publication of this edition coincides with my twelth year of working with Linux and, shockingly, my twenty-fifth year in the computing field. Computing seemed like a fast-moving field back in 1980, but things have sped up a lot since then. Keeping Linux Device Drivers up to date is increasingly a challenge; the Linux kernel hackers continue to improve their code, and they have little patience for documentation that fails to keep up.

Linux 继续在市场上取得成功,更重要的是,在全世界开发人员的心中取得成功。Linux 的成功显然证明了其技术质量以及自由软件的众多优势。但在我看来,它成功的真正关键在于它给计算带来了乐趣。有了Linux,任何人都可以进入系统并在沙箱中玩耍,欢迎来自任何方向的贡献,但卓越的技术高于一切。Linux不仅为我们提供了一个顶级的操作系统;它让我们有机会参与其未来的发展,并在其中享受乐趣。

Linux continues to succeed in the market and, more importantly, in the hearts and minds of developers worldwide. The success of Linux is clearly a testament to its technical quality and to the numerous benefits of free software in general. But the true key to its success, in my opinion, lies in the fact that it has brought the fun back to computing. With Linux, anybody can get their hands into the system and play in a sandbox where contributions from any direction are welcome, but where technical excellence is valued above all else. Linux not only provides us with a top-quality operating system; it gives us the opportunity to be part of its future development and to have fun while we're at it.

在我从事该领域的 25 年中,我有过许多有趣的机会,从对第一台 Cray 计算机(使用 Fortran,在打孔卡上)进行编程,到看到小型计算机和 Unix 工作站的浪潮,一直到当前微处理器主导的时代。然而,我从未见过这个领域如此充满生机、机会和乐趣。我们从来没有对自己的工具及其演变拥有如此强大的控制力。Linux 以及一般的自由软件显然是这些变化背后的驱动力。

In my 25 years in the field, I have had many interesting opportunities, from programming the first Cray computers (in Fortran, on punch cards) to seeing the minicomputer and Unix workstation waves, through to the current, microprocessor-dominated era. Never, though, have I seen the field more full of life, opportunity, and fun. Never have we had such control over our own tools and their evolution. Linux, and free software in general, is clearly the driving force behind those changes.

我希望这个版本能够为新的 Linux 开发人员带来乐趣和机会。无论您的兴趣是在内核还是用户空间,我希望您发现这本书是关于内核如何与硬件一起工作的有用且有趣的指南。我希望它能帮助并激励您启动您的编辑器,并使我们共享的免费操作系统变得更好。Linux 已经取得了长足的进步,但它也才刚刚开始。观看并参与这里发生的事情将非常有趣。

My hope is that this edition helps to bring that fun and opportunity to a new set of Linux developers. Whether your interests are in the kernel or in user space, I hope you find this book to be a useful and interesting guide to just how the kernel works with the hardware. I hope it helps and inspires you to fire up your editor and to make our shared, free operating system even better. Linux has come a long way, but it is also just beginning; it will be more than interesting to watch—and participate in—what happens from here.

亚历山德罗简介

Alessandro's Introduction

我一直很喜欢计算机,因为它们可以与外部硬件通信。因此,在为 Apple II 和 ZX Spectrum 焊接了我的设备后,凭借大学给我的 Unix 和自由软件专业知识的支持,我可以通过在全新的 386 上安装 GNU/Linux 并打开焊接来逃脱 DOS 陷阱。再次熨烫。

I've always enjoyed computers because they can talk to external hardware. So, after soldering my devices for the Apple II and the ZX Spectrum, backed with the Unix and free software expertise the university gave me, I could escape the DOS trap by installing GNU/Linux on a fresh new 386 and by turning on the soldering iron once again.

当时,社区很小,没有太多关于编写驱动程序的文档,所以我开始为 Linux Journal 撰写文章。事情就是这样开始的:当我后来发现我不喜欢写论文时,我离开了大学,发现自己手里拿着一份奥莱利合同。

Back then, the community was a small one, and there wasn't much documentation about writing drivers around, so I started writing for Linux Journal. That's how things started: when I later discovered I didn't like writing papers, I left the univeristy and found myself with an O'Reilly contract in my hands.

那是在 1996 年。很久以前了。

That was in 1996. Ages ago.

现在的计算世界已经不同了:自由软件在技术和政治上看起来都是一个可行的解决方案,但是在这两个领域都有很多工作要做。我希望这本书能实现两个目标:传播技术知识和提高人们对传播知识必要性的认识。这就是为什么在第一版引起公众兴趣后,第二版的两位作者在我们的编辑和出版商的支持下转而使用免费许可证。我敢打赌,这是获取信息的正确方法,与拥有相同愿景的其他人合作真是太好了。

The computing world is different now: free software looks like a viable solution, both technically and politically, but there's a lot of work to do in both realms. I hope this book furthers two aims: spreading technical knowledge and raising awareness about the need to spread knowledge. That's why, after the first edition proved interesting to the public, the two authors of the second edition switched to a free license, supported by our editor and our publisher. I'm betting this is the right approach to information, and it's great to team up with other people sharing this vision.

我对我在嵌入式领域所目睹的一切感到兴奋,我希望本文能帮助您做更多事情;但现在想法进展很快,现在已经到了为第四版做计划并寻找第四位作者提供帮助的时候了。

I'm excited by what I witness in the embedded arena, and I hope this text helps by doing more; but ideas are moving fast these days, and it's already time to plan for the fourth edition, and look for a fourth author to help.

格雷格的简介

Greg's Introduction

好像很久以前我就拿到了这个Linux 设备驱动程序的第一版本书旨在了解如何编写真正的 Linux 驱动程序。第一版是一个很好的指南,帮助我了解这个操作系统的内部结构,我已经使用了很多年,但它的内核从未花时间研究过。凭借从这本书中获得的知识,并通过阅读内核中已有的其他程序员的代码,我的第一个有严重错误、损坏且非常不安全的驱动程序被内核社区接受到主内核树中。尽管五分钟后收到了我的第一份错误报告,但我仍然想尽我所能,使这个操作系统尽可能成为最好的。

It seems like a long time ago that I picked up the first edition of this Linux Device Drivers book in order to figure out how to write a real Linux driver. That first edition was a great guide to helping me understand the internals of this operating system that I had already been using for a number of years but whose kernel had never taken the time to look into. With the knowledge gained from that book, and by reading other programmers' code already present in the kernel, my first horribly buggy, broken, and very SMP-unsafe driver was accepted by the kernel community into the main kernel tree. Despite receiving my first bug report five minutes later, I was hooked on wanting to do as much as I could to make this operating system the best it could possibly be.

我很荣幸能够为本书做出贡献。我希望它能让其他人了解有关内核的细节,发现驱动程序开发并不是一个可怕或令人生畏的地方,并可能鼓励其他人加入并帮助共同努力,使这个操作系统在每个计算平台上运行每种类型的可用设备。开发过程很有趣,社区也很有价值,每个人都从所付出的努力中受益。

I am honored that I've had the ability to contribute to this book. I hope that it enables others to learn the details about the kernel, discover that driver development is not a scary or forbidding place, and possibly encourage others to join in and help in the collective effort of making this operating system work on every computing platform with every type of device available. The development procedure is fun, the community is rewarding, and everyone benefits from the effort involved.

现在,我们又回到了通过修复当前错误、更改 API 以更好地工作且更易于每个人理解以及添加新功能来淘汰此版本。一起来; 我们随时可以使用帮助。

Now it's back to making this edition obsolete by fixing current bugs, changing APIs to work better and be simpler to understand for everyone, and adding new features. Come along; we can always use the help.

本书的读者

Audience for This Book

对于想要尝试计算机的人和需要处理 Linux 机器内部级别的技术程序员来说,这本书应该是一个有趣的信息来源。请注意,“Linux 盒子”是比“运行 Linux 的 PC”更广泛的概念,因为我们的操作系统支持许多平台,并且内核编程绝不局限于特定平台。我们希望这本书对于那些想成为内核黑客但不知道从哪里开始的人来说是有用的起点。

This book should be an interesting source of information both for people who want to experiment with their computer and for technical programmers who face the need to deal with the inner levels of a Linux box. Note that "a Linux box" is a wider concept than "a PC running Linux," as many platforms are supported by our operating system, and kernel programming is by no means bound to a specific platform. We hope this book is useful as a starting point for people who want to become kernel hackers but don't know where to start.

在技​​术方面,本文应该提供一种实践方法来理解内核内部结构以及 Linux 开发人员所做的一些设计选择。尽管本书的主要官方目标是教授如何编写设备驱动程序,但该材料也应该对内核实现进行有趣的概述。

On the technical side, this text should offer a hands-on approach to understanding the kernel internals and some of the design choices made by the Linux developers. Although the main, official target of the book is teaching how to write device drivers, the material should give an interesting overview of the kernel implementation as well.

尽管真正的黑客可以在官方内核源代码中找到所有必要的信息,但通常书面文本有助于培养编程技能。您正在阅读的文本是我们花费数小时耐心研究内核源代码的结果,我们希望最终的结果值得您付出的努力。

Although real hackers can find all the necessary information in the official kernel sources, usually a written text can be helpful in developing programming skills. The text you are approaching is the result of hours of patient grepping through the kernel sources, and we hope the final result is worth the effort it took.

Linux 爱好者应该在本书中找到足够的资源来开始使用代码库,并且应该能够加入不断致力于新功能和性能增强的开发人员团队。当然,本书并没有涵盖整个 Linux 内核,但是 Linux 设备驱动程序作者需要知道如何使用许多内核子系统。因此,它总体上很好地介绍了内核编程。Linux 仍然是一项正在进行中的工作,总有一个地方可以让新程序员加入进来。

The Linux enthusiast should find in this book enough food for her mind to start playing with the code base and should be able to join the group of developers that is continuously working on new capabilities and performance enhancements. This book does not cover the Linux kernel in its entirety, of course, but Linux device driver authors need to know how to work with many of the kernel's subsystems. Therefore, it makes a good introduction to kernel programming in general. Linux is still a work in progress, and there's always a place for new programmers to jump into the game.

另一方面,如果您只是尝试为自己的设备编写设备驱动程序,并且不想破坏内核内部结构,则文本应该足够模块化以满足您的需求。如果您不想深入了解细节,您可以跳过最技术性的部分,并坚持使用设备驱动程序使用的标准 API 来与内核的其余部分无缝集成。

If, on the other hand, you are just trying to write a device driver for your own device, and you don't want to muck with the kernel internals, the text should be modularized enough to fit your needs as well. If you don't want to go deep into the details, you can just skip the most technical sections, and stick to the standard API used by device drivers to seamlessly integrate with the rest of the kernel.

材料的组织

Organization of the Material

本书按复杂程度升序介绍其主题,分为两部分。第一部分(第 1-11 章)从正确设置内核模块开始,然后描述为面向字符的设备编写全功能驱动程序所需的编程的各个方面。每章都讨论了一个不同的问题,并在最后提供了一个快速总结,可以作为实际开发时的参考。

The book introduces its topics in ascending order of complexity and is divided into two parts. The first part (Chapters 1-11) begins with the proper setup of kernel modules and goes on to describe the various aspects of programming that you'll need in order to write a full-featured driver for a char-oriented device. Every chapter covers a distinct problem and includes a quick summary at the end, which can be used as a reference during actual development.

在本书的第一部分中,材料的组织大致从面向软件的概念转向与硬件相关的概念。该组织的目的是让您尽可能在自己的计算机上测试软件,而无需将外部硬件插入机器。每章都包含源代码并指向可以在任何 Linux 计算机上运行的示例驱动程序。然而,在第 9 章和第 10 章中,我们要求您将一英寸的电线连接到并行端口,以测试硬件处理能力,但每个人都应该可以满足这一要求。

Throughout the first part of the book, the organization of the material moves roughly from the software-oriented concepts to the hardware-related ones. This organization is meant to allow you to test the software on your own computer as far as possible without the need to plug external hardware into the machine. Every chapter includes source code and points to sample drivers that you can run on any Linux computer. In Chapter 9 and Chapter 10, however, we ask you to connect an inch of wire to the parallel port in order to test out hardware handling, but this requirement should be manageable by everyone.

本书的后半部分(第 12-18 章)描述了块驱动程序和网络接口,并深入探讨了更高级的主题,例如使用虚拟内存子系统以及 PCI 和 USB 总线。许多驱动程序作者并不需要所有这些材料,但我们鼓励您继续阅读。那里发现的许多材料对于了解 Linux 内核如何工作都很有趣,即使您在特定项目中不需要它。

The second half of the book (Chapters 12-18) describes block drivers and network interfaces and goes deeper into more advanced topics, such as working with the virtual memory subsystem and with the PCI and USB buses. Many driver authors do not need all of this material, but we encourage you to go on reading anyway. Much of the material found there is interesting as a view into how the Linux kernel works, even if you do not need it for a specific project.

背景资料

Background Information

为了能够使用本书,您需要对 C 编程充满信心。还需要一些 Unix 专业知识,因为我们经常提到有关系统调用、命令和管道的 Unix 语义。

In order to be able to use this book, you need to be confident with C programming. Some Unix expertise is needed as well, as we often refer to Unix semantics about system calls, commands, and pipelines.

在硬件层面,理解本书中的内容不需要具备任何专业知识,只要提前明确一般概念即可。该文本并非基于特定的 PC 硬件,当我们提及特定硬件时,我们会提供所有需要的信息。

At the hardware level, no previous expertise is required to understand the material in this book, as long as the general concepts are clear in advance. The text isn't based on specific PC hardware, and we provide all the needed information when we do refer to specific hardware.

构建内核需要多种免费软件工具,并且您通常需要这些工具的特定版本。那些太旧的可能缺乏所需的功能,而那些太新的内核有时可能会生成损坏的内核。通常,任何当前发行版提供的工具都可以正常工作。工具版本要求因内核而异;请参阅您正在使用的内核源代码树中的文档/更改以了解确切的要求。

Several free software tools are needed to build the kernel, and you often need specific versions of these tools. Those that are too old can lack needed features, while those that are too new can occasionally generate broken kernels. Usually, the tools provided with any current distribution work just fine. Tool version requirements vary from one kernel to the next; consult Documentation/Changes in the source tree of the kernel you are using for exact requirements.

在线版本和许可证

Online Version and License

作者选择根据知识共享“署名-相同方式共享”许可证 2.0 版免费提供本书:

The authors have chosen to make this book freely available under the Creative Commons "Attribution-ShareAlike" license, Version 2.0:

http://www.oreilly.com/catalog/linuxdrive3

本书中使用的约定

Conventions Used in This Book

以下是本书中使用的印刷约定的列表:

The following is a list of the typographical conventions used in this book:

斜体
Italic

用于文件和目录名称、程序和命令名称、命令行选项、URL 和新术语

Used for file and directory names, program and command names, command-line options, URLs, and new terms

Constant Width
Constant Width

在示例中用于显示文件内容或命令的输出,在文本中用于指示 C 代码或其他文字字符串中出现的单词

Used in examples to show the contents of files or the output from commands, and in the text to indicate words that appear in C code or other literal strings

Constant Width Italic
Constant Width Italic

用于指示用户用实际值替换的命令中的文本

Used to indicate text within commands that the user replaces with an actual value

Constant Width Bold
Constant Width Bold

在示例中用于显示应由用户逐字输入的命令或其他文本

Used in examples to show commands or other text that should be typed literally by the user

请特别注意使用以下图标与文本分开的注释:

Pay special attention to notes set apart from the text with the following icons:

提示

Tip

这是一个提示。它包含有关当前主题的有用补充信息。

This is a tip. It contains useful supplementary information about the topic at hand.

警告

Warning

这是一个警告。它可以帮助您解决并避免恼人的问题。

This is a warning. It helps you solve and avoid annoying problems.

使用代码示例

Using Code Examples

本书旨在帮助您完成工作。一般来说,您可以在您的程序和文档中使用本书中的代码。代码示例受 BSD/GPL 双重许可证的保护。

This book is here to help you get your job done. In general, you may use the code in this book in your programs and documentation. The code samples are covered by a dual BSD/GPL license.

我们赞赏但不要求归属。归属通常包括标题、作者、出版商和 ISBN。例如:“ Linux 设备驱动程序,第三版,作者:Jonathan Corbet、Alessandro Rubini 和 Greg Kroah-Hartman。版权所有 2005 O'Reilly Media, Inc.,0-596-00590-3。”

We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: "Linux Device Drivers, Third Edition, by Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman. Copyright 2005 O'Reilly Media, Inc., 0-596-00590-3."

我们希望收到您的来信

We'd Like to Hear from You

请向出版商提出有关本书的意见和问题:

Please address comments and questions concerning this book to the publisher:

奥莱利媒体公司
格拉文斯坦公路北1005号
塞瓦斯托波尔, CA 95472
(800) 998-9938(美国或加拿大)
(707) 829-0515(国际或本地)
(707) 829-0104(传真)

我们有本书的网页,其中列出了勘误表、示例和任何其他信息。您可以通过以下地址访问此页面:

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at:

http://www.oreilly.com/catalog/linuxdrive3

要评论或询问有关本书的技术问题,请发送电子邮件至:

To comment or ask technical questions about this book, send email to:

有关我们的书籍、会议、资源中心和 O'Reilly Network 的更多信息,请访问我们的网站:

For more information about our books, conferences, Resource Centers, and the O'Reilly Network, see our web site at:

http://www.oreilly.com

Safari 启用

Safari Enabled

没有标题的图像

当您在最喜欢的技术书籍的封面上看到“Safari® 已启用”图标时,这意味着该书可通过 O'Reilly Network Safari 书架在线获取。

When you see a Safari® Enabled icon on the cover of your favorite technology book, that means the book is available online through the O'Reilly Network Safari Bookshelf.

Safari 提供了比电子书更好的解决方案。它是一个虚拟图书馆,可让您轻松搜索数千本顶级技术书籍、剪切和粘贴代码示例、下载章节,并在需要最准确的最新信息时快速找到答案。请访问http://safari.oreilly.com免费试用。

Safari offers a solution that's better than e-books. It's a virtual library that lets you easily search thousands of top tech books, cut and paste code samples, download chapters, and find quick answers when you need the most accurate, current information. Try it for free at http://safari.oreilly.com.

致谢

Acknowledgments

当然,这本书并不是凭空写成的。我们要感谢许多帮助使之成为可能的人。

This book, of course, was not written in a vacuum; we would like to thank the many people who have helped to make it possible.

感谢我们的编辑安迪·奥拉姆(Andy Oram);由于他的努力,这本书是一个更好的产品。显然,我们非常感谢那些为当前自由软件复兴奠定了哲学和实践基础的聪明人。

Thanks to our editor, Andy Oram; this book is a vastly better product as a result of his efforts. And obviously we owe a lot to the smart people who have laid the philosophical and practical foundations of the current free software renaissance.

第一版由 Alan Cox、Greg Hankins、Hans Lermen、Heiko Eissfeldt 和 Miguel de Icaza(按名字字母顺序排列)进行了技术审查。第二版的技术审阅者包括 Allan B. Cruse、Christian Morgner、Jake Edge、Jeff Garzik、Jens Axboe、Jerry Cooperstein、Jerome Peter Lynch、Michael Kerrisk、Paul Kinzelman 和 Raph Levien。第三版的审稿人包括艾伦·B·克鲁斯 (Allan B. Cruse)、克里斯蒂安·莫格纳 (Christian Morgner)、詹姆斯·博顿利 (James Bottomley)、杰里·库珀斯坦 (Jerry Cooperstein)、帕特里克·莫切尔 (Patrick Mochel)、保罗·金泽尔曼 (Paul Kinzelman) 和罗伯特·洛夫 (Robert Love)。这些人共同付出了大量的努力来发现问题并指出我们写作的可能改进之处。

The first edition was technically reviewed by Alan Cox, Greg Hankins, Hans Lermen, Heiko Eissfeldt, and Miguel de Icaza (in alphabetic order by first name). The technical reviewers for the second edition were Allan B. Cruse, Christian Morgner, Jake Edge, Jeff Garzik, Jens Axboe, Jerry Cooperstein, Jerome Peter Lynch, Michael Kerrisk, Paul Kinzelman, and Raph Levien. Reviewers for the third edition were Allan B. Cruse, Christian Morgner, James Bottomley, Jerry Cooperstein, Patrick Mochel, Paul Kinzelman, and Robert Love. Together, these people have put a vast amount of effort into finding problems and pointing out possible improvements to our writing.

最后但并非最不重要的一点是,我们感谢 Linux 开发人员的不懈努力。这包括内核程序员和用户空间人员,他们经常被遗忘。在本书中,我们选择永远不直呼他们的名字,以免对我们可能忘记的人不公平。有时我们会打破这条规则,直呼莱纳斯的名字。我们希望他不介意。

Last but certainly not least, we thank the Linux developers for their relentless work. This includes both the kernel programmers and the user-space people, who often get forgotten. In this book, we chose never to call them by name in order to avoid being unfair to someone we might forget. We sometimes made an exception to this rule and called Linus by name; we hope he doesn't mind.

乔恩

Jon

首先,我必须感谢我的妻子劳拉 (Laura) 以及我的孩子米歇尔 (Michele) 和朱莉娅 (Giulia),是你们让我的生活充满了欢乐,并在编写此版本时耐心地忍受了我的干扰。LWN.net 的订阅者通过他们的慷慨捐助,使大部分工作得以顺利进行。Linux 内核开发人员为我提供了很好的服务,他们让我成为他们社区的一员,回答我的问题,并在我感到困惑时为我指点迷津。感谢本书第二版的读者,他们在世界各地的 Linux 聚会上提出的评论令人欣慰且鼓舞人心。我要特别感谢 Alessandro Rubini 从第一版开始整个练习(并一直坚持到当前版本);格雷格·克罗-哈特曼 (Greg Kroah-Hartman) 运用自己丰富的技能在多个章节中发挥作用,并取得了巨大的成果。

I must begin by thanking my wife Laura and my children Michele and Giulia for filling my life with joy and patiently putting up with my distraction while working on this edition. The subscribers of LWN.net have, through their generosity, enabled much of this work to happen. The Linux kernel developers have done me a great service by letting me be a part of their community, answering my questions, and setting me straight when I got confused. Thanks are due to readers of the second edition of this book whose comments, offered at Linux gatherings over much of the world, have been gratifying and inspiring. And I would especially like to thank Alessandro Rubini for starting this whole exercise with the first edition (and staying with it through the current edition); and Greg Kroah-Hartman, who has brought his considerable skills to bear on several chapters, with great results.

亚历山德罗

Alessandro

我要感谢使这项工作成为可能的人们。首先,费代丽卡(Federica)令人难以置信的耐心,她甚至在我们的蜜月期间让我在帐篷里带着一台笔记本电脑复习第一版。我要感谢乔治和朱利亚,他们参与了本书的后续版本,并很高兴地接受了自己是经常工作到深夜的“gnu”的儿子。我非常感谢所有自由软件作者,他们通过让任何人都可以学习他们的作品来真正教会我如何编程。但对于这一版,我最要感谢的是 Jon 和 Greg,他们是这项工作中的好搭档。如果没有他们,它就不可能存在,因为代码库更大、更难,而我的时间是更稀缺的资源,总是受到客户、免费软件问题和过期期限的争夺。乔恩(Jon)是这一届的伟大领导者。他们在 SMP 和数字运算器方面的专业知识补充了我对编程的小规模和嵌入式观点,非常高效且在技术上非常宝贵。

I would like to thank the people that made this work possible. First of all, the incredible patience of Federica, who went as far as letting me review the first edition during our honeymoon, with a laptop in the tent. I want to thank Giorgio and Giulia, who have been involved in later editions of the book and happily accepted to be sons of "a gnu" who often works late in the night. I owe a lot to all the free-software authors who actually taught me how to program by making their work available for anyone to study. But for this edition, I'm mostly grateful to Jon and Greg, who have been great mates in this work; it couldn't have existed without each and both of them, as the code base is bigger and tougher, while my time is a scarcer resource, always contended for by clients, free software issues, and expired deadlines. Jon has been a great leader for this edition; both have been very productive and technically invaluable in supplementing my small-scale and embedded view toward programming with their expertise about SMP and number crunchers.

格雷格

Greg

我要感谢我的妻子香农 (Shannon) 以及孩子玛德琳 (Madeline) 和格里芬 (Griffin) 在我花时间写这本书时的理解和耐心。如果没有他们对我最初的 Linux 开发工作的支持,我根本不可能完成这本书。还要感谢亚历山德罗和乔恩愿意让我参与这本书的工作;我很荣幸他们让我参与其中。非常感谢所有 Linux 内核程序员,他们无私地在公众视野中编写代码,这样我和其他人就可以从阅读中学到很多东西。另外,对于每一个曾经向我发送错误报告、批评我的代码、并因我做了愚蠢的事情而责备我的人,你们都教会了我很多关于如何成为一名更好的程序员的知识,在整个过程中,让我感到非常欢迎成为这个社区的一部分。谢谢。

I would like to thank my wife Shannon and my children Madeline and Griffin for their understanding and patience while I took the time to work on this book. If it were not for their support of my original Linux development efforts, I would not be able to do this book at all. Thanks also to Alessandro and Jon for offering to let me work on this book; I am honored that they let me participate in it. Much gratitude is given to all of the Linux kernel programmers, who were unselfish enough to write code in the public view, so that I and others could learn so much from just reading it. Also, for everyone who has ever sent me bug reports, critiqued my code, and flamed me for doing stupid things, you have all taught me so much about how to be a better programmer and, throughout it all, made me feel very welcome to be part of this community. Thank you.

第 1 章设备驱动程序简介

Chapter 1. An Introduction to Device Drivers

以 Linux 为代表的免费操作系统的众多优点之一是其内部结构可供所有人查看。操作系统曾经是一个黑暗而神秘的领域,其代码仅限于少数程序员,现在任何具有必要技能的人都可以轻松地检查、理解和修改。Linux 帮助操作系统实现了民主化。然而,Linux 内核仍然是一个庞大而复杂的代码体,潜在的内核黑客需要一个入口点,以便他们可以在不被复杂性压倒的情况下接近代码。通常,设备驱动程序提供该网关。

One of the many advantages of free operating systems, as typified by Linux, is that their internals are open for all to view. The operating system, once a dark and mysterious area whose code was restricted to a small number of programmers, can now be readily examined, understood, and modified by anybody with the requisite skills. Linux has helped to democratize operating systems. The Linux kernel remains a large and complex body of code, however, and would-be kernel hackers need an entry point where they can approach the code without being overwhelmed by complexity. Often, device drivers provide that gateway.

设备驱动程序在 Linux 内核中扮演着特殊的角色。它们是独特的“黑匣子”,使特定的硬件响应明确定义的内部编程接口;它们完全隐藏了设备工作原理的细节。用户活动通过一组独立于特定驱动程序的标准化调用来执行;将这些调用映射到作用于真实硬件的特定于设备的操作就是设备驱动程序的作用。这种编程接口使得驱动程序可以与内核的其余部分分开构建,并在需要时在运行时“插入”。这种模块化使得 Linux 驱动程序易于编写,以至于现在有数百个可用的驱动程序。

Device drivers take on a special role in the Linux kernel. They are distinct "black boxes" that make a particular piece of hardware respond to a well-defined internal programming interface; they hide completely the details of how the device works. User activities are performed by means of a set of standardized calls that are independent of the specific driver; mapping those calls to device-specific operations that act on real hardware is then the role of the device driver. This programming interface is such that drivers can be built separately from the rest of the kernel and "plugged in" at runtime when needed. This modularity makes Linux drivers easy to write, to the point that there are now hundreds of them available.

对 Linux 设备驱动程序的编写感兴趣有很多原因。仅新硬件可用(和过时!)的速度就保证了驱动程序编写者在可预见的未来将非常忙碌。个人可能需要了解驱动程序才能访问他们感兴趣的特定设备。硬件供应商通过为其产品提供 Linux 驱动程序,可以将庞大且不断增长的 Linux 用户群添加到其潜在市场中。Linux 系统的开源特性意味着,如果驱动程序编写者愿意,驱动程序的源代码可以快速传播给数百万用户。

There are a number of reasons to be interested in the writing of Linux device drivers. The rate at which new hardware becomes available (and obsolete!) alone guarantees that driver writers will be busy for the foreseeable future. Individuals may need to know about drivers in order to gain access to a particular device that is of interest to them. Hardware vendors, by making a Linux driver available for their products, can add the large and growing Linux user base to their potential markets. And the open source nature of the Linux system means that if the driver writer wishes, the source to a driver can be quickly disseminated to millions of users.

本书教您如何编写自己的驱动程序以及如何修改内核的相关部分。我们采取了与设备无关的方法;只要有可能,就会呈现编程技术和接口,而不依赖于任何特定设备。每个司机都是不同的;作为驱动程序编写者,您需要很好地了解您的特定设备。但大多数原理和基本技术对于所有驾驶员来说都是相同的。本书无法教您有关设备的信息,但它可以让您了解使设备正常工作所需的背景知识。

This book teaches you how to write your own drivers and how to hack around in related parts of the kernel. We have taken a device-independent approach; the programming techniques and interfaces are presented, whenever possible, without being tied to any specific device. Each driver is different; as a driver writer, you need to understand your specific device well. But most of the principles and basic techniques are the same for all drivers. This book cannot teach you about your device, but it gives you a handle on the background you need to make your device work.

当您学习编写驱动程序时,您通常会了解很多有关 Linux 内核的信息;这可能会帮助您了解机器的工作原理以及为什么事情并不总是像您期望的那么快或不完全符合您的要求。我们逐渐引入新的想法,从非常简单的驱动程序开始,并在此基础上进行构建;每个新概念都附有示例代码,不需要特殊的硬件进行测试。

As you learn to write drivers, you find out a lot about the Linux kernel in general; this may help you understand how your machine works and why things aren't always as fast as you expect or don't do quite what you want. We introduce new ideas gradually, starting off with very simple drivers and building on them; every new concept is accompanied by sample code that doesn't need special hardware to be tested.

本章实际上并不涉及编写代码。不过,我们会介绍一些有关 Linux 内核的背景概念,当我们开始编程时,您会很高兴了解这些概念。

This chapter doesn't actually get into writing code. However, we introduce some background concepts about the Linux kernel that you'll be glad you know later, when we do launch into programming.

设备驱动程序的作用

The Role of the Device Driver

作为一名程序员,你是 能够对驱动程序做出自己的选择,并在所需的编程时间和结果的灵活性之间选择可接受的权衡。虽然说驱动程序是“灵活的”可能看起来很奇怪,但我们喜欢这个词,因为它强调设备驱动程序的作用是提供 机制,而不是政策

As a programmer, you are able to make your own choices about your driver, and choose an acceptable trade-off between the programming time required and the flexibility of the result. Though it may appear strange to say that a driver is "flexible," we like this word because it emphasizes that the role of a device driver is providing mechanism, not policy.

机制和策略之间的区别是 Unix 设计背后的最佳思想之一。大多数编程问题确实可以分为两部分:“要提供什么功能”(机制)和“如何使用这些功能”(策略)。如果这两个问题由程序的不同部分,甚至完全由不同的程序来解决,则软件包更容易开发并适应特定需求。

The distinction between mechanism and policy is one of the best ideas behind the Unix design. Most programming problems can indeed be split into two parts: "what capabilities are to be provided" (the mechanism) and "how those capabilities can be used" (the policy). If the two issues are addressed by different parts of the program, or even by different programs altogether, the software package is much easier to develop and to adapt to particular needs.

例如,图形显示的 Unix 管理分为 X 服务器和窗口和会话管理器,X 服务器了解硬件并为用户程序提供统一的接口,而窗口和会话管理器则在不了解硬件的情况下实现特定策略。人们可以在不同的硬件上使用相同的窗口管理器,不同的用户可以在同一工作站上运行不同的配置。即使完全不同的桌面环境,例如 KDE 和 GNOME,也可以在同一系统上共存。另一个例子是 TCP/IP 网络的分层结构:操作系统提供套接字抽象,它没有实现有关要传输的数据的策略,而不同的服务器负责服务(及其相关策略)。而且, ftpd提供文件传输机制,用户可以使用任何他们喜欢的客户端;命令行和图形客户端都存在,任何人都可以编写新的用户界面来传输文件。

For example, Unix management of the graphic display is split between the X server, which knows the hardware and offers a unified interface to user programs, and the window and session managers, which implement a particular policy without knowing anything about the hardware. People can use the same window manager on different hardware, and different users can run different configurations on the same workstation. Even completely different desktop environments, such as KDE and GNOME, can coexist on the same system. Another example is the layered structure of TCP/IP networking: the operating system offers the socket abstraction, which implements no policy regarding the data to be transferred, while different servers are in charge of the services (and their associated policies). Moreover, a server like ftpd provides the file transfer mechanism, while users can use whatever client they prefer; both command-line and graphic clients exist, and anyone can write a new user interface to transfer files.

就驱动因素而言,同样适用机制和政策的分离。软盘驱动程序不受策略限制,其作用只是将软盘显示为连续的数据块阵列。系统的更高级别提供策略,例如谁可以访问软盘驱动器、是否直接访问驱动器或通过文件系统访问驱动器以及用户是否可以在驱动器上安装文件系统。由于不同的环境通常需要以不同的方式使用硬件,因此尽可能不受策略限制非常重要。

Where drivers are concerned, the same separation of mechanism and policy applies. The floppy driver is policy free—its role is only to show the diskette as a continuous array of data blocks. Higher levels of the system provide policies, such as who may access the floppy drive, whether the drive is accessed directly or via a filesystem, and whether users may mount filesystems on the drive. Since different environments usually need to use hardware in different ways, it's important to be as policy free as possible.

程序员在编写驱动程序时应特别注意这个基本概念:编写内核代码来访问硬件,但不要将特定策略强加给用户,因为不同的用户有不同的需求。驱动程序应该处理使硬件可用的问题,将有关如何使用硬件的所有问题留给 应用程序。如果驱动程序提供对硬件功能的访问而不增加限制,那么它就是灵活的。然而,有时必须做出一些政策决定。例如,数字 I/O 驱动程序可能仅提供对硬件的字节宽度访问,以避免处理各个位所需的额外代码。

When writing drivers, a programmer should pay particular attention to this fundamental concept: write kernel code to access the hardware, but don't force particular policies on the user, since different users have different needs. The driver should deal with making the hardware available, leaving all the issues about how to use the hardware to the applications. A driver, then, is flexible if it offers access to the hardware capabilities without adding constraints. Sometimes, however, some policy decisions must be made. For example, a digital I/O driver may only offer byte-wide access to the hardware in order to avoid the extra code needed to handle individual bits.

您还可以从不同的角度看待驱动程序:它是位于应用程序和实际设备之间的软件层。驱动程序的这种特权角色允许驱动程序程序员准确选择设备的显示方式:不同的驱动程序可以即使对于同一设备也提供不同的功能。实际的驱动程序设计应该在许多不同的考虑因素之间进行平衡。例如,单个设备可以由不同的程序同时使用,并且驱动程序程序员可以完全自由地确定如何处理并发性。您可以独立于其硬件功能在设备上实现内存映射,或者可以提供用户库来帮助应用程序员在可用原语之上实施新策略,等等。一个主要的考虑因素是向用户提供尽可能多的选项的愿望与编写驱动程序的时间之间的权衡,以及需要保持简单以免出现错误。

You can also look at your driver from a different perspective: it is a software layer that lies between the applications and the actual device. This privileged role of the driver allows the driver programmer to choose exactly how the device should appear: different drivers can offer different capabilities, even for the same device. The actual driver design should be a balance between many different considerations. For instance, a single device may be used concurrently by different programs, and the driver programmer has complete freedom to determine how to handle concurrency. You could implement memory mapping on the device independently of its hardware capabilities, or you could provide a user library to help application programmers implement new policies on top of the available primitives, and so forth. One major consideration is the trade-off between the desire to present the user with as many options as possible and the time you have to write the driver, as well as the need to keep things simple so that errors don't creep in.

不受政策影响的驱动因素有一些典型特征。其中包括对同步和异步操作的支持、多次打开的能力、充分利用硬件功能的能力以及缺乏软件层来“简化事物”或提供与策略相关的操作。此类驱动程序不仅对最终用户来说工作得更好,而且也更容易编写和维护。无策略实际上是软件设计者的共同目标。

Policy-free drivers have a number of typical characteristics. These include support for both synchronous and asynchronous operation, the ability to be opened multiple times, the ability to exploit the full capabilities of the hardware, and the lack of software layers to "simplify things" or provide policy-related operations. Drivers of this sort not only work better for their end users, but also turn out to be easier to write and maintain as well. Being policy-free is actually a common target for software designers.

事实上,许多设备驱动程序是与 用户程序可帮助配置和访问目标设备。这些程序的范围从简单的实用程序到完整的图形应用程序。示例包括tunelp 程序,它调整并行端口打印机驱动程序的运行方式,以及图形卡ctl 实用程序是 PCMCIA 驱动程序包的一部分。通常还提供客户端库,它提供不需要作为驱动程序本身的一部分实现的功能。

Many device drivers, indeed, are released together with user programs to help with configuration and access to the target device. Those programs can range from simple utilities to complete graphical applications. Examples include the tunelp program, which adjusts how the parallel port printer driver operates, and the graphical cardctl utility that is part of the PCMCIA driver package. Often a client library is provided as well, which provides capabilities that do not need to be implemented as part of the driver itself.

本书的范围是内核,因此我们尽量不涉及政策问题或应用程序或支持库。有时我们会讨论不同的策略以及如何支持它们,但我们不会详细介绍使用设备的程序或它们执行的策略。但是,您应该了解,用户程序是软件包的组成部分,甚至无策略软件包也是随 将默认行为应用于底层机制的配置文件。

The scope of this book is the kernel, so we try not to deal with policy issues or with application programs or support libraries. Sometimes we talk about different policies and how to support them, but we won't go into much detail about programs using the device or the policies they enforce. You should understand, however, that user programs are an integral part of a software package and that even policy-free packages are distributed with configuration files that apply a default behavior to the underlying mechanisms.

分裂内核

Splitting the Kernel

在 Unix 系统中, 多个并发进程 负责不同的任务。每个进程都需要系统资源,无论是计算能力、内存、网络连接还是其他资源。内核是负责处理所有此类请求的一大块可执行代码虽然不同内核任务之间的区别并不总是清晰可见,但内核的角色可以分为(如图1-1所示)为以下部分:

In a Unix system, several concurrent processes attend to different tasks. Each process asks for system resources, be it computing power, memory, network connectivity, or some other resource. The kernel is the big chunk of executable code in charge of handling all such requests. Although the distinction between the different kernel tasks isn't always clearly marked, the kernel's role can be split (as shown in Figure 1-1) into the following parts:

流程管理
Process management

内核负责 创建和销毁进程以及处理它们与外部世界的连接(输入和输出)。不同进程之间的通信(通过信号、管道或进程间通信原语)是整个系统功能的基础,并且也由内核处理。此外,控制进程如何共享CPU的调度程序是进程管理的一部分。更一般地说,内核的进程管理活动在单个 CPU 或其中几个 CPU 之上实现了多个进程的抽象。

The kernel is in charge of creating and destroying processes and handling their connection to the outside world (input and output). Communication among different processes (through signals, pipes, or interprocess communication primitives) is basic to the overall system functionality and is also handled by the kernel. In addition, the scheduler, which controls how processes share the CPU, is part of process management. More generally, the kernel's process management activity implements the abstraction of several processes on top of a single CPU or a few of them.

内存管理
Memory management

计算机的内存是一个 主要资源以及用于处理它的策略对于系统性能至关重要。内核在有限的可用资源之上为任何和所有进程构建虚拟寻址空间。内核的不同部分通过一组函数调用与内存管理子系统交互,从简单的 malloc / free对到更复杂的功能。

The computer's memory is a major resource, and the policy used to deal with it is a critical one for system performance. The kernel builds up a virtual addressing space for any and all processes on top of the limited available resources. The different parts of the kernel interact with the memory-management subsystem through a set of function calls, ranging from the simple malloc/free pair to much more complex functionalities.

文件系统
Filesystems

Unix 在很大程度上基于文件系统概念;Unix 中几乎所有东西都可以被视为文件。内核在非结构化硬件之上构建结构化文件系统,并且由此产生的文件抽象在整个系统中大量使用。此外,Linux 支持多种文件系统类型,即在物理介质上组织数据的不同方式。例如,磁盘可以使用 Linux 标准 ext3 文件系统、常用的 FAT 文件系统或其他几种文件系统进行格式化。

Unix is heavily based on the filesystem concept; almost everything in Unix can be treated as a file. The kernel builds a structured filesystem on top of unstructured hardware, and the resulting file abstraction is heavily used throughout the whole system. In addition, Linux supports multiple filesystem types, that is, different ways of organizing data on the physical medium. For example, disks may be formatted with the Linux-standard ext3 filesystem, the commonly used FAT filesystem or several others.

设备控制
Device control

几乎每个系统操作最终都会映射到物理设备。除了处理器、存储器和极少数其他实体之外,任何和所有设备控制操作均由特定于所寻址设备的代码执行。该代码称为设备驱动程序。内核必须为系统上存在的每个外围设备(从硬盘驱动器到键盘和磁带驱动器)嵌入一个设备驱动程序。内核功能的这一方面是我们对本书的主要兴趣。

Almost every system operation eventually maps to a physical device. With the exception of the processor, memory, and a very few other entities, any and all device control operations are performed by code that is specific to the device being addressed. That code is called a device driver. The kernel must have embedded in it a device driver for every peripheral present on a system, from the hard drive to the keyboard and the tape drive. This aspect of the kernel's functions is our primary interest in this book.

联网
Networking

网络必须是 由操作系统管理,因为大多数网络操作并不特定于某个进程:传入数据包是异步事件。在进程处理数据包之前,必须先收集、识别和发送数据包。该系统负责跨程序和网络接口传递数据包,并且必须根据程序的网络活动控制程序的执行。此外,所有 路由 和地址解析问题是在内核中实现的。

Networking must be managed by the operating system, because most network operations are not specific to a process: incoming packets are asynchronous events. The packets must be collected, identified, and dispatched before a process takes care of them. The system is in charge of delivering data packets across program and network interfaces, and it must control the execution of programs according to their network activity. Additionally, all the routing and address resolution issues are implemented within the kernel.

内核的分割视图

图 1-1。内核的分割视图

Figure 1-1. A split view of the kernel

可加载模块

Loadable Modules

的良好特性之一Linux 能够在运行时扩展内核提供的功能集。这意味着您可以在系统启动并运行时向内核添加功能(也可以删除功能)。

One of the good features of Linux is the ability to extend at runtime the set of features offered by the kernel. This means that you can add functionality to the kernel (and remove functionality as well) while the system is up and running.

每一段可以在运行时添加到内核的代码称为模块 。Linux 内核提供对多种不同类型(或类别)的模块的支持,包括但不限于设备驱动程序。每个模块都由目标代码(未链接成完整的可执行文件)组成,可以通过 insmod 动态链接到正在运行的内核 程序并可以通过取消链接 rmmod程序。

Each piece of code that can be added to the kernel at runtime is called a module . The Linux kernel offers support for quite a few different types (or classes) of modules, including, but not limited to, device drivers. Each module is made up of object code (not linked into a complete executable) that can be dynamically linked to the running kernel by the insmod program and can be unlinked by the rmmod program.

图 1-1标识了负责特定任务的不同类别的模块——根据模块提供的功能,模块被认为属于特定的类别。图 1-1中的模块布局涵盖了最重要的类,但还远远不够完整,因为越来越多的功能 Linux正在模块化。

Figure 1-1 identifies different classes of modules in charge of specific tasks—a module is said to belong to a specific class according to the functionality it offers. The placement of modules in Figure 1-1 covers the most important classes, but is far from complete because more and more functionality in Linux is being modularized.

设备和模块的类别

Classes of Devices and Modules

Linux 查看设备的方式 区分三种基本设备类型。每个模块通常实现这些类型之一,因此可分为字符模块块模块网络模块。这种将模块划分为不同类型或类别的做法并不是严格的。程序员可以选择构建在单个代码块中实现不同驱动程序的巨大模块。然而,优秀的程序员通常会为他们实现的每个新功能创建不同的模块,因为分解是可伸缩性和可扩展性的关键要素。

The Linux way of looking at devices distinguishes between three fundamental device types. Each module usually implements one of these types, and thus is classifiable as a char module, a block module, or a network module. This division of modules into different types, or classes, is not a rigid one; the programmer can choose to build huge modules implementing different drivers in a single chunk of code. Good programmers, nonetheless, usually create a different module for each new functionality they implement, because decomposition is a key element of scalability and extendability.

这三个类是:

The three classes are:

字符设备
Character devices

A字符(char)设备是一种可以作为字节流(如文件)进行访问的设备;A字符 driver 负责实现此行为。这样的驱动程序通常至少实现 openclosereadwrite系统调用。文本控制台(/dev/console)和串行端口(/dev/ttyS0等)是字符设备的示例,因为它们可以通过流抽象很好地表示。字符设备通过文件系统节点访问,例如/dev/tty1/dev/lp0。字符设备和常规文件之间的唯一相关区别是您始终可以在常规文件中来回移动,而大多数字符设备只是数据通道,您只能顺序访问。尽管如此,仍然存在看起来像数据区域的字符设备,您可以在其中来回移动;例如,这通常适用于图像采集卡,其中应用程序可以使用mmaplseek访问整个获取的图像 。

A character (char) device is one that can be accessed as a stream of bytes (like a file); a char driver is in charge of implementing this behavior. Such a driver usually implements at least the open, close, read, and write system calls. The text console (/dev/console) and the serial ports (/dev/ttyS0 and friends) are examples of char devices, as they are well represented by the stream abstraction. Char devices are accessed by means of filesystem nodes, such as /dev/tty1 and /dev/lp0. The only relevant difference between a char device and a regular file is that you can always move back and forth in the regular file, whereas most char devices are just data channels, which you can only access sequentially. There exist, nonetheless, char devices that look like data areas, and you can move back and forth in them; for instance, this usually applies to frame grabbers, where the applications can access the whole acquired image using mmap or lseek.

块设备
Block devices

与 char 设备一样,块设备由/dev中的文件系统节点访问目录。块设备是可以承载文件系统的设备(例如磁盘)。在大多数 Unix 系统中,块设备只能处理传输一个或多个整个块的 I/O 操作,这些块的长度通常为 512 字节(或 2 的较大幂)字节。相反,Linux 允许应用程序读取和写入块设备(如字符设备)——它允许一次传输任意数量的字节。因此,块设备和字符设备的区别仅在于内核内部管理数据的方式,以及内核/驱动程序软件接口的区别。与字符设备一样,每个块设备都是通过文件系统节点来访问的,它们之间的区别对用户来说是透明的。

Like char devices, block devices are accessed by filesystem nodes in the /dev directory. A block device is a device (e.g., a disk) that can host a filesystem. In most Unix systems, a block device can only handle I/O operations that transfer one or more whole blocks, which are usually 512 bytes (or a larger power of two) bytes in length. Linux, instead, allows the application to read and write a block device like a char device—it permits the transfer of any number of bytes at a time. As a result, block and char devices differ only in the way data is managed internally by the kernel, and thus in the kernel/driver software interface. Like a char device, each block device is accessed through a filesystem node, and the difference between them is transparent to the user. Block drivers have a completely different interface to the kernel than char drivers.

网络接口
Network interfaces

任何网络交易都是 通过接口,即能够与其他主机交换数据的设备。通常,接口是硬件设备,但也可能是纯软件设备,例如环回接口。网络接口负责发送和接收数据包,由内核的网络子系统驱动,而不知道各个事务如何映射到正在传输的实际数据包。许多网络连接(尤其是使用 TCP 的连接)都是面向流的,但网络设备通常是围绕数据包的传输和接收而设计的。网络驱动程序对各个连接一无所知;它只处理数据包。

Any network transaction is made through an interface, that is, a device that is able to exchange data with other hosts. Usually, an interface is a hardware device, but it might also be a pure software device, like the loopback interface. A network interface is in charge of sending and receiving data packets, driven by the network subsystem of the kernel, without knowing how individual transactions map to the actual packets being transmitted. Many network connections (especially those using TCP) are stream-oriented, but network devices are, usually, designed around the transmission and receipt of packets. A network driver knows nothing about individual connections; it only handles packets.

网络接口不是面向流的设备,因此无法像/dev/tty1那样轻松映射到文件系统中的节点。Unix方式提供 对接口的访问仍然是通过为其分配唯一的名称(例如eth0),但该名称在文件系统中没有相应的条目。内核和网络设备驱动程序之间的通信与字符和块驱动程序之间的通信完全不同。内核调用与数据包传输相关的函数,而不是readwrite 。

Not being a stream-oriented device, a network interface isn't easily mapped to a node in the filesystem, as /dev/tty1 is. The Unix way to provide access to interfaces is still by assigning a unique name to them (such as eth0), but that name doesn't have a corresponding entry in the filesystem. Communication between the kernel and a network device driver is completely different from that used with char and block drivers. Instead of read and write, the kernel calls functions related to packet transmission.

还有其他与上述设备类型正交的驱动程序模块分类方法。一般来说,某些类型的驱动程序与给定类型设备的附加内核支持功能层一起工作。例如,人们可以谈论 通用串行总线(USB)模块、串行模块、SCSI 模块等。每个 USB 设备都由与 USB 子系统配合使用的 USB 模块驱动,但设备本身在系统中显示为字符设备(例如 USB 串行端口)、块设备(USB 存储卡读卡器)或网络设备(USB 以太网接口)。

There are other ways of classifying driver modules that are orthogonal to the above device types. In general, some types of drivers work with additional layers of kernel support functions for a given type of device. For example, one can talk of universal serial bus (USB) modules, serial modules, SCSI modules, and so on. Every USB device is driven by a USB module that works with the USB subsystem, but the device itself shows up in the system as a char device (a USB serial port, say), a block device (a USB memory card reader), or a network device (a USB Ethernet interface).

最近,其他类别的设备驱动程序已添加到内核中,包括 FireWire 驱动程序和 I2O 驱动程序。与处理 USB 和 SCSI 驱动程序的方式相同,内核开发人员收集了类范围的功能并将其导出给驱动程序实现者,以避免重复工作和错误,从而简化和加强了编写此类驱动程序的过程。

Other classes of device drivers have been added to the kernel in recent times, including FireWire drivers and I2O drivers. In the same way that they handled USB and SCSI drivers, kernel developers collected class-wide features and exported them to driver implementers to avoid duplicating work and bugs, thus simplifying and strengthening the process of writing such drivers.

除了设备驱动程序之外,其他功能(包括硬件和软件)也在内核中进行了模块化。 一个常见的例子是文件系统。文件系统类型决定了如何在块设备上组织信息以表示目录和文件树。这样的实体不是设备驱动程序,因为没有与信息的放置方式相关联的显式设备;相反,文件系统类型是软件驱动程序,因为它将低级数据结构映射到高级数据结构。文件系统决定文件名的长度以及每个文件的哪些信息存储在目录条目中。这文件系统模块必须通过将文件名和路径(以及其他信息,例如访问模式)映射到存储在数据块中的数据结构来实现访问目录和文件的最低级别的系统调用。这样的接口完全独立于磁盘(或其他介质)的实际数据传输,这是由块设备驱动程序完成的。

In addition to device drivers, other functionalities, both hardware and software, are modularized in the kernel. One common example is filesystems. A filesystem type determines how information is organized on a block device in order to represent a tree of directories and files. Such an entity is not a device driver, in that there's no explicit device associated with the way the information is laid down; the filesystem type is instead a software driver, because it maps the low-level data structures to high-level data structures. It is the filesystem that determines how long a filename can be and what information about each file is stored in a directory entry. The filesystem module must implement the lowest level of the system calls that access directories and files, by mapping filenames and paths (as well as other information, such as access modes) to data structures stored in data blocks. Such an interface is completely independent of the actual data transfer to and from the disk (or other medium), which is accomplished by a block device driver.

如果您想到 Unix 系统对底层文件系统的依赖程度,您就会意识到这样的软件概念对于系统操作至关重要。解码文件系统信息的能力处于内核层次结构的最低级别,并且至关重要;即使你为你的新 CD-ROM 编写了一个块驱动程序,如果你不能运行lscp ,它也是无用的在其托管的数据上。Linux 支持文件系统模块的概念,其软件接口声明可以在文件系统 inode、目录、文件和超级块上执行的不同操作。对于程序员来说,实际上需要编写文件系统模块是很不寻常的,因为官方内核已经包含了最重要的文件系统类型的代码。

If you think of how strongly a Unix system depends on the underlying filesystem, you'll realize that such a software concept is vital to system operation. The ability to decode filesystem information stays at the lowest level of the kernel hierarchy and is of utmost importance; even if you write a block driver for your new CD-ROM, it is useless if you are not able to run ls or cp on the data it hosts. Linux supports the concept of a filesystem module, whose software interface declares the different operations that can be performed on a filesystem inode, directory, file, and superblock. It's quite unusual for a programmer to actually need to write a filesystem module, because the official kernel already includes code for the most important filesystem types.

安全问题

Security Issues

安全是一个日益 现代重要的关注点。我们将讨论整本书中出现的与安全相关的问题。然而,有一些现在值得一提的一般概念。

Security is an increasingly important concern in modern times. We will discuss security-related issues as they come up throughout the book. There are a few general concepts, however, that are worth mentioning now.

系统中的任何安全检查都是由内核代码强制执行的。如果内核存在安全漏洞,那么整个系统也存在漏洞。在官方内核发行版中,只有授权用户才能加载模块;系统调用init_module检查调用进程是否是 授权将模块加载到内核中。因此,当运行官方内核时,只有超级用户[ 1 ]或成功获得特权的入侵者才能利用特权代码的力量。

Any security check in the system is enforced by kernel code. If the kernel has security holes, then the system as a whole has holes. In the official kernel distribution, only an authorized user can load modules; the system call init_module checks if the invoking process is authorized to load a module into the kernel. Thus, when running an official kernel, only the superuser,[1] or an intruder who has succeeded in becoming privileged, can exploit the power of privileged code.

如果可能,驱动程序编写者应避免在代码中编码安全策略。安全性是一个策略问题,通常最好在系统管理员的控制下在内核中的更高级别进行处理。然而,总有例外。作为设备驱动程序编写者,您应该意识到某些类型的设备访问可能会对整个系统产生不利影响的情况,并且应该提供足够的控制。例如,影响全局资源(例如设置中断线)、可能损坏硬件(例如加载固件)或可能影响其他用户(例如在磁带驱动器上设置默认块大小)的设备操作,通常仅适用于具有足够特权的用户,

When possible, driver writers should avoid encoding security policy in their code. Security is a policy issue that is often best handled at higher levels within the kernel, under the control of the system administrator. There are always exceptions, however. As a device driver writer, you should be aware of situations in which some types of device access could adversely affect the system as a whole and should provide adequate controls. For example, device operations that affect global resources (such as setting an interrupt line), which could damage the hardware (loading firmware, for example), or that could affect other users (such as setting a default block size on a tape drive), are usually only available to sufficiently privileged users, and this check must be made in the driver itself.

当然,驱动程序编写者还必须小心,以避免引入安全错误。C 编程语言很容易犯几种类型的错误。当前许多安全问题都是由缓冲区溢出等造成的 错误,其中程序员忘记检查有多少数据写入了缓冲区,并且数据最终写入到缓冲区末尾之外,从而覆盖了不相关的数据。此类错误可能会危及整个系统,因此必须避免。幸运的是,在设备驱动程序上下文中,避免这些错误通常相对容易,在该上下文中,用户的接口定义得很窄并且受到高度控制。

Driver writers must also be careful, of course, to avoid introducing security bugs. The C programming language makes it easy to make several types of errors. Many current security problems are created, for example, by buffer overrun errors, in which the programmer forgets to check how much data is written to a buffer, and data ends up written beyond the end of the buffer, thus overwriting unrelated data. Such errors can compromise the entire system and must be avoided. Fortunately, avoiding these errors is usually relatively easy in the device driver context, in which the interface to the user is narrowly defined and highly controlled.

其他一些一般性的安全理念也值得牢记。任何从用户进程收到的输入都应该高度怀疑;除非你能验证它,否则永远不要相信它。小心未初始化的内存;从内核获得的任何内存在可供用户进程或设备使用之前都应该清零或以其他方式初始化。否则,信息泄露(数据披露,密码等)可能会导致。如果您的设备解释发送给它的数据,请确保用户无法发送任何可能危害系统的内容。最后,思考一下设备操作可能带来的影响;如果存在可能影响系统的特定操作(例如,在适配器板上重新加载固件或格式化磁盘),那么这些操作几乎肯定应仅限于特权用户。

Some other general security ideas are worth keeping in mind. Any input received from user processes should be treated with great suspicion; never trust it unless you can verify it. Be careful with uninitialized memory; any memory obtained from the kernel should be zeroed or otherwise initialized before being made available to a user process or device. Otherwise, information leakage (disclosure of data, passwords, etc.) could result. If your device interprets data sent to it, be sure the user cannot send anything that could compromise the system. Finally, think about the possible effect of device operations; if there are specific operations (e.g., reloading the firmware on an adapter board or formatting a disk) that could affect the system, those operations should almost certainly be restricted to privileged users.

另外,当从第三方接收软件时,尤其是涉及内核时,也要小心:因为每个人都可以访问源代码,所以每个人都可以破坏和重新编译东西。尽管您通常可以信任发行版中的预编译内核,但您应该避免运行由不受信任的朋友编译的内核 - 如果您不以 root 身份运行预编译的二进制文件,那么您最好不要运行预编译的内核。例如,恶意修改的内核可能允许任何人加载模块,从而通过init_module打开意外的后门。

Be careful, also, when receiving software from third parties, especially when the kernel is concerned: because everybody has access to the source code, everybody can break and recompile things. Although you can usually trust precompiled kernels found in your distribution, you should avoid running kernels compiled by an untrusted friend—if you wouldn't run a precompiled binary as root, then you'd better not run a precompiled kernel. For example, a maliciously modified kernel could allow anyone to load a module, thus opening an unexpected back door via init_module.

请注意,Linux 内核可以编译为不支持任何模块,从而消除任何与模块相关的安全漏洞。当然,在这种情况下,所有需要的驱动程序都必须直接构建到内核本身中。对于 2.2 及更高版本的内核,还可以通过功能机制在系统引导后禁用内核模块的加载。

Note that the Linux kernel can be compiled to have no module support whatsoever, thus closing any module-related security holes. In this case, of course, all needed drivers must be built directly into the kernel itself. It is also possible, with 2.2 and later kernels, to disable the loading of kernel modules after system boot via the capability mechanism.

版本编号

Version Numbering

在深入研究之前 编程时,我们应该评论 Linux 中使用的版本编号方案以及本书涵盖了哪些版本。

Before digging into programming, we should comment on the version numbering scheme used in Linux and which versions are covered by this book.

首先,请注意,每个 Linux 系统中使用的软件包有自己的版本号,并且它们之间通常存在相互依赖性:您需要一个软件包的特定版本来运行另一个软件包的特定版本。Linux发行版的创建者通常会处理匹配软件包的混乱问题,而从预打包发行版安装的用户则不需要处理版本号。另一方面,那些更换和升级系统软件的人在这方面就只能靠自己了。幸运的是,几乎所有现代发行版都支持通过检查包间依赖关系来升级单个包;在满足依赖关系之前,发行版的包管理器通常不允许升级。

First of all, note that every software package used in a Linux system has its own release number, and there are often interdependencies across them: you need a particular version of one package to run a particular version of another package. The creators of Linux distributions usually handle the messy problem of matching packages, and the user who installs from a prepackaged distribution doesn't need to deal with version numbers. Those who replace and upgrade system software, on the other hand, are on their own in this regard. Fortunately, almost all modern distributions support the upgrade of single packages by checking interpackage dependencies; the distribution's package manager generally does not allow an upgrade until the dependencies are satisfied.

要运行我们在讨论中介绍的示例,除了 2.6 内核所需的工具之外,您不需要任何特定版本的工具;任何最新的 Linux 发行版都可以用来运行我们的示例。我们不会详细说明具体要求,因为 如果您遇到任何问题,内核源代码中的文档/更改文件是此类信息的最佳来源。

To run the examples we introduce during the discussion, you won't need particular versions of any tool beyond what the 2.6 kernel requires; any recent Linux distribution can be used to run our examples. We won't detail specific requirements, because the file Documentation/Changes in your kernel sources is the best source of such information if you experience any problems.

就内核而言,偶数内核版本(即 2.6.x 是用于一般分发的稳定版本。相反,奇怪的版本(例如 2.7.x )是开发快照,并且非常短暂;其中最新的代表了当前的开发状态,但几天左右就会过时。

As far as the kernel is concerned, the even-numbered kernel versions (i.e., 2.6.x) are the stable ones that are intended for general distribution. The odd versions (such as 2.7.x), on the contrary, are development snapshots and are quite ephemeral; the latest of them represents the current status of development, but becomes obsolete in a few days or so.

本书涵盖了 2.6 版内核。我们的重点是向设备驱动程序编写者展示 2.6.10(我们编写时的当前版本)中可用的所有功能。本书的这个版本不涵盖内核的早期版本。对于那些感兴趣的人,第二版详细介绍了 2.0 到 2.4 版本。该版本仍可在线获取: http: //lwn.net/Kernel/LDD2/

This book covers Version 2.6 of the kernel. Our focus has been to show all the features available to device driver writers in 2.6.10, the current version at the time we are writing. This edition of the book does not cover prior versions of the kernel. For those of you who are interested, the second edition covered Versions 2.0 through 2.4 in detail. That edition is still available online at http://lwn.net/Kernel/LDD2/.

内核程序员应该意识到开发过程在 2.6 中发生了变化。2.6 系列现在正在接受以前被认为对于“稳定”内核而言太大的更改。除其他外,这意味着内部内核编程接口可以改变,从而可能使本书的部分内容过时;因此,已知文本随附的示例代码可在 2.6.10 中使用,但某些模块无法在早期版本下编译。我们鼓励想要跟上内核编程变化的程序员加入邮件列表并使用参考书目中列出的网站。还有一个网页位于http://lwn.net/Articles/2.6-kernel-api/,其中包含自本书出版以来发生的 API 更改的信息。

Kernel programmers should be aware that the development process changed with 2.6. The 2.6 series is now accepting changes that previously would have been considered too large for a "stable" kernel. Among other things, that means that internal kernel programming interfaces can change, thus potentially obsoleting parts of this book; for this reason, the sample code accompanying the text is known to work with 2.6.10, but some modules don't compile under earlier versions. Programmers wanting to keep up with kernel programming changes are encouraged to join the mailing lists and to make use of the web sites listed in the bibliography. There is also a web page maintained at http://lwn.net/Articles/2.6-kernel-api/, which contains information about API changes that have happened since this book was published.

本文并未具体讨论奇数内核版本。普通用户永远没有理由运行开发内核。然而,尝试新功能的开发人员希望运行最新的开发版本。他们通常会不断升级到最新版本,以修复错误并实现新的功能。但请注意,我们无法保证实验内核[ 2 ],并且如果您因非当前奇数内核中的错误而遇到问题,则没有人可以帮助您。那些运行奇数版本内核的人通常有足够的技能来深入研究代码,而不需要教科书,这也是我们在这里不讨论开发内核的另一个原因。

This text doesn't talk specifically about odd-numbered kernel versions. General users never have a reason to run development kernels. Developers experimenting with new features, however, want to be running the latest development release. They usually keep upgrading to the most recent version to pick up bug fixes and new implementations of features. Note, however, that there's no guarantee on experimental kernels,[2] and nobody helps you if you have problems due to a bug in a noncurrent odd-numbered kernel. Those who run odd-numbered versions of the kernel are usually skilled enough to dig in the code without the need for a textbook, which is another reason why we don't talk about development kernels here.

Linux 的另一个特点是它是一个独立于平台的操作系统,而不仅仅是“克隆 PC 的 Unix 克隆”:它目前支持大约 20 种体系结构。本书尽可能与平台无关,所有代码示例至少在x86和x86-64平台上进行了测试。由于代码已在 32 位和 64 位处理器上进行了测试,因此它应该可以在所有其他平台上编译和运行。正如您所期望的,依赖于特定硬件的代码示例并不适用于所有支持的平台, 但这总是在源代码中声明的。

Another feature of Linux is that it is a platform-independent operating system, not just "a Unix clone for PC clones" anymore: it currently supports some 20 architectures. This book is platform independent as far as possible, and all the code samples have been tested on at least the x86 and x86-64 platforms. Because the code has been tested on both 32-bit and 64-bit processors, it should compile and run on all other platforms. As you might expect, the code samples that rely on particular hardware don't work on all the supported platforms, but this is always stated in the source code.

许可条款

License Terms

Linux 已获得以下版本 2 的许可 GNU 通用公共许可证 (GPL),自由软件基金会为 GNU 项目设计的文档。GPL 允许任何人重新分发甚至销售 GPL 涵盖的产品,只要接收者能够访问源代码并能够行使相同的权利。此外,从 GPL 涵盖的产品派生的任何软件产品(如果要重新分发)都必须根据 GPL 发布。

Linux is licensed under Version 2 of the GNU General Public License (GPL), a document devised for the GNU project by the Free Software Foundation. The GPL allows anybody to redistribute, and even sell, a product covered by the GPL, as long as the recipient has access to the source and is able to exercise the same rights. Additionally, any software product derived from a product covered by the GPL must, if it is redistributed at all, be released under the GPL.

这种许可证的主要目标是通过允许每个人随意修改程序来促进知识的增长;与此同时,向公众销售软件的人仍然可以完成他们的工作。尽管目标很简单,但关于 GPL 及其使用的讨论却永无休止。如果您想阅读许可证,您可以在系统中的多个位置找到它,包括 COPYING文件中内核源代码树的顶层目录。

The main goal of such a license is to allow the growth of knowledge by permitting everybody to modify programs at will; at the same time, people selling software to the public can still do their job. Despite this simple objective, there's a never-ending discussion about the GPL and its use. If you want to read the license, you can find it in several places in your system, including the top directory of your kernel source tree in the COPYING file.

供应商经常询问他们是否可以仅以二进制形式分发内核模块。这个问题的答案被故意含糊其辞。到目前为止,二进制模块的分发(只要它们遵守已发布的内核接口)是可以容忍的。但内核的版权由许多开发人员持有,并且并非所有人都同意内核模块不是派生产品。如果您或您的雇主希望在非自由许可证下分发内核模块,您确实需要与您的法律顾问讨论这种情况。另请注意,即使在稳定的内核系列中间,内核开发人员也毫不犹豫地反对在内核版本之间破坏二进制模块。如果可能的话

Vendors often ask whether they can distribute kernel modules in binary form only. The answer to that question has been deliberately left ambiguous. Distribution of binary modules—as long as they adhere to the published kernel interface—has been tolerated so far. But the copyrights on the kernel are held by many developers, and not all of them agree that kernel modules are not derived products. If you or your employer wish to distribute kernel modules under a nonfree license, you really need to discuss the situation with your legal counsel. Please note also that the kernel developers have no qualms against breaking binary modules between kernel releases, even in the middle of a stable kernel series. If it is at all possible, both you and your users are better off if you release your module as free software.

如果您希望代码进入主线内核,或者您的代码需要内核补丁,则必须在发布代码后立即使用 GPL 兼容许可证。尽管个人使用您的更改不会将 GPL 强加给您,但如果您分发代码,则必须在分发中包含源代码 - 必须允许获取您的软件包的人随意重建二进制文件。

If you want your code to go into the mainline kernel, or if your code requires patches to the kernel, you must use a GPL-compatible license as soon as you release the code. Although personal use of your changes doesn't force the GPL on you, if you distribute your code, you must include the source code in the distribution—people acquiring your package must be allowed to rebuild the binary at will.

就本书而言,大部分代码都可以自由地重新分发,无论是源代码还是二进制形式,我们和 O'Reilly 都不保留对任何派生作品的任何权利。所有程序都可以在ftp://ftp.ora.com/pub/examples/linux/drivers/上找到,确切的许可条款在同一目录中的LICENSE文件中说明。

As far as this book is concerned, most of the code is freely redistributable, either in source or binary form, and neither we nor O'Reilly retain any right on any derived works. All the programs are available at ftp://ftp.ora.com/pub/examples/linux/drivers/, and the exact license terms are stated in the LICENSE file in the same directory.

加入内核开发社区

Joining the Kernel Development Community

当您开始为以下内容编写模块时 Linux 内核,您将成为更大的开发人员社区的一部分。在这个社区中,您不仅可以找到从事类似工作的人员,还可以找到一群高度敬业的工程师,致力于使 Linux 成为一个更好的系统。这些人也可以成为帮助、想法和批判性评论的来源——当您为新驱动程序寻找测试人员时,他们可能是您首先会求助的人。

As you begin writing modules for the Linux kernel, you become part of a larger community of developers. Within that community, you can find not only people engaged in similar work, but also a group of highly committed engineers working toward making Linux a better system. These people can be a source of help, ideas, and critical review as well—they will be the first people you will likely turn to when you are looking for testers for a new driver.

Linux 内核开发人员的中心聚集点是 linux-kernel邮件列表。所有主要的内核开发人员,从 Linus Torvalds 开始,都订阅了这个列表。请注意,该列表不适合胆小的人:截至撰写本文时,流量每天最多可达 200 条消息或更多。尽管如此,对于那些对内核开发感兴趣的人来说,遵循这个列表是必不可少的;对于那些需要内核开发帮助的人来说,它也可以成为优质资源。

The central gathering point for Linux kernel developers is the linux-kernel mailing list. All major kernel developers, from Linus Torvalds on down, subscribe to this list. Please note that the list is not for the faint of heart: traffic as of this writing can run up to 200 messages per day or more. Nonetheless, following this list is essential for those who are interested in kernel development; it also can be a top-quality resource for those in need of kernel development help.

要加入 linux-kernel 列表,请按照 linux-kernel 邮件列表常见问题解答中的说明进行操作: http: //www.tux.org/lkml。阅读常见问题解答的其余部分;那里有大量有用的信息。Linux 内核开发人员都是忙碌的人,他们更倾向于帮助那些首先明确完成作业的人。

To join the linux-kernel list, follow the instructions found in the linux-kernel mailing list FAQ: http://www.tux.org/lkml. Read the rest of the FAQ while you are at it; there is a great deal of useful information there. Linux kernel developers are busy people, and they are much more inclined to help people who have clearly done their homework first.

本书概述

Overview of the Book

从这里开始,我们进入了内核编程的世界。第 2 章介绍了模块化,解释了该技术的秘密并展示了运行模块的代码。第 3 章讨论了字符驱动程序,并展示了基于内存的设备驱动程序的完整代码,该驱动程序可以有趣地读写。使用内存作为设备的硬件基础,任何人都可以运行示例代码,而无需获取特殊硬件。

From here on, we enter the world of kernel programming. Chapter 2 introduces modularization, explaining the secrets of the art and showing the code for running modules. Chapter 3 talks about char drivers and shows the complete code for a memory-based device driver that can be read and written for fun. Using memory as the hardware base for the device allows anyone to run the sample code without the need to acquire special hardware.

调试技术是程序员的重要工具,将在第 4 章中介绍。对于那些想要破解当代内核的人来说,同样重要的是并发和竞争条件的管理。第5章关注并发访问资源所带来的问题,并介绍Linux控制并发的机制。

Debugging techniques are vital tools for the programmer and are introduced in Chapter 4. Equally important for those who would hack on contemporary kernels is the management of concurrency and race conditions. Chapter 5 concerns itself with the problems posed by concurrent access to resources and introduces the Linux mechanisms for controlling concurrency.

掌握了调试和并发管理技能后,我们将转向 char 驱动程序的高级功能,例如阻塞操作、 select的使用以及重要的ioctl调用;这些主题是第 6 章的主题。

With debugging and concurrency management skills in place, we move to advanced features of char drivers, such as blocking operations, the use of select, and the important ioctl call; these topics are the subject of Chapter 6.

在讨论硬件管理之前,我们先剖析一些内核的软件接口:第 7 章展示了内核中如何管理时间,第 8 章解释了内存分配。

Before dealing with hardware management, we dissect a few more of the kernel's software interfaces: Chapter 7 shows how time is managed in the kernel, and Chapter 8 explains memory allocation.

接下来我们重点关注硬件。第 9 章描述了设备上 I/O 端口和内存缓冲区的管理;之后是第 10 章中的中断处理。不幸的是,并不是每个人都能运行这些章节的示例代码,因为实际上需要一些硬件支持测试软件接口中断。我们已尽力将所需的硬件支持保持在最低限度,但您仍然需要一些简单的硬件(例如标准并行端口)来使用这些章节的示例代码。

Next we focus on hardware. Chapter 9 describes the management of I/O ports and memory buffers that live on the device; after that comes interrupt handling, in Chapter 10. Unfortunately, not everyone is able to run the sample code for these chapters, because some hardware support is actually needed to test the software interface interrupts. We've tried our best to keep required hardware support to a minimum, but you still need some simple hardware, such as a standard parallel port, to work with the sample code for these chapters.

第11章介绍了内核中数据类型的使用以及可移植代码的编写。

Chapter 11 covers the use of data types in the kernel and the writing of portable code.

本书的后半部分致力于更高级的主题。我们首先深入了解硬件,特别是特定外设总线的功能。 第 12 章介绍了为 PCI 设备编写驱动程序的详细信息,第 13 章介绍了使用 USB 设备的 API。

The second half of the book is dedicated to more advanced topics. We start by getting deeper into the hardware and, in particular, the functioning of specific peripheral buses. Chapter 12 covers the details of writing drivers for PCI devices, and Chapter 13 examines the API for working with USB devices.

了解了外围总线后,我们可以详细了解 Linux 设备模型,它是内核用来描述其管理的硬件和软件资源的抽象层。第14章是对设备模型基础设施的自下而上的观察,从kobject类型开始并从那里开始工作。它涵盖了设备模型与真实硬件的集成;然后,它利用这些知识来涵盖热插拔设备和电源管理等主题。

With an understanding of peripheral buses in place, we can take a detailed look at the Linux device model, which is the abstraction layer used by the kernel to describe the hardware and software resources it is managing. Chapter 14 is a bottom-up look at the device model infrastructure, starting with the kobject type and working up from there. It covers the integration of the device model with real hardware; it then uses that knowledge to cover topics like hot-pluggable devices and power management.

第 15 章中,我们将转向 Linux 内存管理。本章介绍如何将内核内存映射到用户空间( mmap系统调用),将用户内存映射到内核空间(使用 get_user_pages),以及如何将任一类型的内存映射到设备空间(以执行直接内存访问 [DMA] 操作)。

In Chapter 15, we take a diversion into Linux memory management. This chapter shows how to map kernel memory into user space (the mmap system call), map user memory into kernel space (with get_user_pages), and how to map either kind of memory into device space (to perform direct memory access [DMA] operations).

我们对内存的理解对于接下来的两章很有用,它们涵盖了其他主要的驱动程序类。第 16 章 介绍了块驱动程序,并展示了它们与我们迄今为止使用过的字符驱动程序的不同之处。然后第 17 章开始讨论网络驱动程序的编写。我们最后讨论了串行驱动程序和参考书目。

Our understanding of memory will be useful for the following two chapters, which cover the other major driver classes. Chapter 16 introduces block drivers and shows how they are different from the char drivers we have worked with so far. Then Chapter 17 gets into the writing of network drivers. We finish up with a discussion of serial drivers and a bibliography.




[ 1 ]从技术上讲,只有有CAP_SYS_MODULE能力的人才能执行此操作。我们将在第 6 章中讨论功能。

[1] Technically, only somebody with the CAP_SYS_MODULE capability can perform this operation. We discuss capabilities in Chapter 6.

[ 2 ]请注意,偶数内核也没有保证,除非您依赖于授予自己保修的商业提供商。

[2] Note that there's no guarantee on even-numbered kernels as well, unless you rely on a commercial provider that grants its own warranty.

第 2 章构建和运行模块

Chapter 2. Building and Running Modules

差不多该开始编程了。本章介绍有关模块和内核编程的所有基本概念。在这几页中,我们构建并运行了一个完整的(如果相对无用的)模块,并查看了所有模块共享的一些基本代码。开发此类专业知识是任何类型的模块化驱动程序的重要基础。为了避免一次性引入太多概念,本章仅讨论模块,而不涉及任何特定的设备类。

It's almost time to begin programming. This chapter introduces all the essential concepts about modules and kernel programming. In these few pages, we build and run a complete (if relatively useless) module, and look at some of the basic code shared by all modules. Developing such expertise is an essential foundation for any kind of modularized driver. To avoid throwing in too many concepts at once, this chapter talks only about modules, without referring to any specific device class.

这里介绍的所有内核项(函数、变量、头文件和宏)都在本章末尾的参考部分中进行了描述。

All the kernel items (functions, variables, header files, and macros) that are introduced here are described in a reference section at the end of the chapter.

设置您的测试系统

Setting Up Your Test System

从本章开始,我们 提供示例模块来演示编程概念。(所有这些示例都可以在 O'Reilly 的 FTP 站点上找到,如第 1 章中所述。)构建、加载和修改这些示例是加深您对驱动程序如何工作以及与内核交互的理解的好方法。

Starting with this chapter, we present example modules to demonstrate programming concepts. (All of these examples are available on O'Reilly's FTP site, as explained in Chapter 1.) Building, loading, and modifying these examples are a good way to improve your understanding of how drivers work and interact with the kernel.

示例模块应该适用于几乎所有 2.6.x 内核,包括发行版供应商提供的内核。但是,我们建议您直接从以下位置获取“主线”内核 kernel.org镜像网络并将其安装到您的系统上。供应商内核可以进行大量修补并且与主线不同;有时,供应商补丁可以更改设备驱动程序所看到的内核 API。如果您正在编写必须在特定发行版上运行的驱动程序,您肯定会希望针对相关内核进行构建和测试。但是,为了学习驱动程序编写,标准内核是最好的。

The example modules should work with almost any 2.6.x kernel, including those provided by distribution vendors. However, we recommend that you obtain a "mainline" kernel directly from the kernel.org mirror network, and install it on your system. Vendor kernels can be heavily patched and divergent from the mainline; at times, vendor patches can change the kernel API as seen by device drivers. If you are writing a driver that must work on a particular distribution, you will certainly want to build and test against the relevant kernels. But, for the purpose of learning about driver writing, a standard kernel is best.

无论内核的来源如何,构建 2.6.x 的模块都要求您在系统上配置并构建内核树。此要求是对以前版本内核的更改,其中当前的头文件集就足够了。2.6 模块与内核源代码树中找到的目标文件链接;结果是一个更强大的模块加载器,但也要求这些目标文件可用。因此,您的首要任务是提出一个内核源代码树(来自 kernel.org网络或您的发行商的内核源代码包),构建一个新内核,并将其安装在您的系统上。由于我们稍后会看到的原因,如果您在构建模块时实际运行目标内核,那么生活通常是最简单的,尽管这不是必需的。

Regardless of the origin of your kernel, building modules for 2.6.x requires that you have a configured and built kernel tree on your system. This requirement is a change from previous versions of the kernel, where a current set of header files was sufficient. 2.6 modules are linked against object files found in the kernel source tree; the result is a more robust module loader, but also the requirement that those object files be available. So your first order of business is to come up with a kernel source tree (either from the kernel.org network or your distributor's kernel source package), build a new kernel, and install it on your system. For reasons we'll see later, life is generally easiest if you are actually running the target kernel when you build your modules, though this is not required.

警告

Warning

您还应该考虑在哪里进行模块实验、开发和测试。我们已尽最大努力使示例模块安全且正确,但错误的可能性始终存在。内核代码中的错误可能会导致用户进程的终止,有时甚至会导致整个系统的终止。它们通常不会造成更严重的问题,例如磁盘损坏。尽管如此,建议您在不包含无法承受丢失的数据且不执行基本服务的系统上进行内核实验。内核黑客通常会为了测试新代码而保留一个“牺牲”系统。

You should also give some thought to where you do your module experimentation, development, and testing. We have done our best to make our example modules safe and correct, but the possibility of bugs is always present. Faults in kernel code can bring about the demise of a user process or, occasionally, the entire system. They do not normally create more serious problems, such as disk corruption. Nonetheless, it is advisable to do your kernel experimentation on a system that does not contain data that you cannot afford to lose, and that does not perform essential services. Kernel hackers typically keep a "sacrificial" system around for the purpose of testing new code.

因此,如果您还没有合适的系统在磁盘上配置和构建内核源代码树,那么现在是进行设置的好时机。我们会等待。一旦完成该任务,您就可以开始使用内核模块了。

So, if you do not yet have a suitable system with a configured and built kernel source tree on disk, now would be a good time to set that up. We'll wait. Once that task is taken care of, you'll be ready to start playing with kernel modules.

你好世界模块

The Hello World Module

许多编程书籍开始 以“hello world”示例作为展示最简单的程序的方式。本书讨论的是内核模块而不是程序;因此,对于不耐烦的读者来说,以下代码是一个完整的“hello world”模块:

Many programming books begin with a "hello world" example as a way of showing the simplest possible program. This book deals in kernel modules rather than programs; so, for the impatient reader, the following code is a complete "hello world" module:

#include <linux/init.h>
#include <linux/module.h>
MODULE_LICENSE("双 BSD/GPL");

静态 int hello_init(void)
{
    printk(KERN_ALERT "你好,世界\n");
    返回0;
}

静态无效hello_exit(无效)
{
    printk(KERN_ALERT "再见,残酷的世界\n");
}

module_init(hello_init);
module_exit(hello_exit);
#include <linux/init.h>
#include <linux/module.h>
MODULE_LICENSE("Dual BSD/GPL");

static int hello_init(void)
{
    printk(KERN_ALERT "Hello, world\n");
    return 0;
}

static void hello_exit(void)
{
    printk(KERN_ALERT "Goodbye, cruel world\n");
}

module_init(hello_init);
module_exit(hello_exit);

该模块定义了两个 函数,一个在模块加载到内核时调用(hello_init),另一个在模块删除时调用(hello_exit)。module_init和 module_exit行使用特殊的内核宏来指示这两个函数作用。另一个特殊的宏(MODULE_LICENSE)用于告诉内核该模块拥有免费许可证;如果没有这样的声明,内核在加载模块时会发出抱怨。

This module defines two functions, one to be invoked when the module is loaded into the kernel (hello_init) and one for when the module is removed (hello_exit). The module_init and module_exit lines use special kernel macros to indicate the role of these two functions. Another special macro (MODULE_LICENSE) is used to tell the kernel that this module bears a free license; without such a declaration, the kernel complains when the module is loaded.

打印k 函数在 Linux 内核中定义并可供模块使用;它的行为类似于标准 C 库函数printf。内核需要自己的打印功能,因为它自己运行,不需要C库的帮助。该模块可以调用 printk,因为在insmod加载它之后,该模块链接到内核并可以访问内核的公共符号(函数和变量,如下一节详细介绍)。该字符串KERN_ALERT是消息的优先级。[ 1 ]我们在此模块中指定了高优先级,因为具有默认优先级的消息可能不会显示在任何有用的地方,具体取决于您正在运行的内核版本、 klogd的版本 守护进程和您的配置。您暂时可以忽略这个问题;我们将在第 4 章中对其进行解释。

The printk function is defined in the Linux kernel and made available to modules; it behaves similarly to the standard C library function printf. The kernel needs its own printing function because it runs by itself, without the help of the C library. The module can call printk because, after insmod has loaded it, the module is linked to the kernel and can access the kernel's public symbols (functions and variables, as detailed in the next section). The string KERN_ALERT is the priority of the message.[1] We've specified a high priority in this module, because a message with the default priority might not show up anywhere useful, depending on the kernel version you are running, the version of the klogd daemon, and your configuration. You can ignore this issue for now; we explain it in Chapter 4.

您可以使用insmod测试该模块 rmmod实用程序,如下所示。请注意,只有超级用户可以加载和卸载模块。

You can test the module with the insmod and rmmod utilities, as shown below. Note that only the superuser can load and unload a module.

%make
make[1]: 进入目录 `/usr/src/linux-2.6.10'
  抄送 [M] /home/ldd3/src/misc-modules/hello.o
  构建模块,第 2 阶段。
  模块化邮政
  CC /home/ldd3/src/misc-modules/hello.mod.o
  LD [M] /home/ldd3/src/misc-modules/hello.ko
make[1]: 离开目录 `/usr/src/linux-2.6.10'
%su
根#insmod ./hello.ko
你好世界
根#rmmod hello
再见残酷的世界
根#
% make
make[1]: Entering directory `/usr/src/linux-2.6.10'
  CC [M]  /home/ldd3/src/misc-modules/hello.o
  Building modules, stage 2.
  MODPOST
  CC      /home/ldd3/src/misc-modules/hello.mod.o
  LD [M]  /home/ldd3/src/misc-modules/hello.ko
make[1]: Leaving directory `/usr/src/linux-2.6.10'
% su
root# insmod ./hello.ko
Hello, world
root# rmmod hello
Goodbye cruel world
root#

请再次注意,要使上述命令序列起作用,您必须在 makefile 能够找到它的位置(示例中为 /usr/src/linux-2.6.10)正确配置和构建内核所示)。我们在2.4 节中详细介绍了如何构建模块。

Please note once again that, for the above sequence of commands to work, you must have a properly configured and built kernel tree in a place where the makefile is able to find it (/usr/src/linux-2.6.10 in the example shown). We get into the details of how modules are built in Section 2.4.

根据您的系统用于传递消息行的机制,您的输出可能会有所不同。特别是,之前的屏幕转储是从文本控制台获取的;如果您从在窗口系统下运行的终端仿真器运行insmodrmmod,您将不会在屏幕上看到任何内容。该消息将发送到系统日志文件之一,例如/var/log/messages(实际文件的名称因 Linux 发行版而异)。用于交付的机制 内核消息在第 4 章中描述。

According to the mechanism your system uses to deliver the message lines, your output may be different. In particular, the previous screen dump was taken from a text console; if you are running insmod and rmmod from a terminal emulator running under the window system, you won't see anything on your screen. The message goes to one of the system log files, such as /var/log/messages (the name of the actual file varies between Linux distributions). The mechanism used to deliver kernel messages is described in Chapter 4.

正如您所看到的,编写模块并不像您想象的那么困难——至少,只要模块不需要做任何有价值的事情。困难的部分是了解您的设备以及如何最大限度地提高性能。我们在整个过程中更深入地探讨了模块化 章节并将特定于设备的问题留给后面的章节。

As you can see, writing a module is not as difficult as you might expect—at least, as long as the module is not required to do anything worthwhile. The hard part is understanding your device and how to maximize performance. We go deeper into modularization throughout this chapter and leave device-specific issues for later chapters.

内核模块与应用程序

Kernel Modules Versus Applications

在我们进一步讨论之前,值得 强调内核模块和应用程序之间的各种差异。

Before we go further, it's worth underlining the various differences between a kernel module and an application.

虽然大多数中小型应用程序从头到尾执行单个任务,但每个内核模块只是注册自己以服务将来的请求,并且其初始化函数立即终止。换句话说,模块的初始化函数的任务是为以后调用模块的函数做准备;就好像该模块在说:“我在这里,这就是我能做的。” 模块的退出函数(hello_exit在示例中)在模块卸载之前被调用。它应该告诉内核,“我已经不在那里了;不要要求我做任何其他事情。” 这种编程方法类似于事件驱动编程,但虽然并非所有应用程序都是事件驱动的,但每个内核模块都是事件驱动的。事件驱动应用程序和内核代码之间的另一个主要区别在于退出函数:虽然终止的应用程序可能会懒惰地释放资源或完全避免清理,但模块的退出函数必须小心地撤消 init 函数构建的所有 内容,或者这些碎片会保留下来,直到系统重新启动。

While most small and medium-sized applications perform a single task from beginning to end, every kernel module just registers itself in order to serve future requests, and its initialization function terminates immediately. In other words, the task of the module's initialization function is to prepare for later invocation of the module's functions; it's as though the module were saying, "Here I am, and this is what I can do." The module's exit function (hello_exit in the example) gets invoked just before the module is unloaded. It should tell the kernel, "I'm not there anymore; don't ask me to do anything else." This kind of approach to programming is similar to event-driven programming, but while not all applications are event-driven, each and every kernel module is. Another major difference between event-driven applications and kernel code is in the exit function: whereas an application that terminates can be lazy in releasing resources or avoids clean up altogether, the exit function of a module must carefully undo everything the init function built up, or the pieces remain around until the system is rebooted.

顺便说一句,卸载模块的能力是您最欣赏的模块化功能之一,因为它有助于缩短开发时间;您可以测试新驱动程序的连续版本,而无需每次都经历漫长的关闭/重新启动周期。

Incidentally, the ability to unload a module is one of the features of modularization that you'll most appreciate, because it helps cut down development time; you can test successive versions of your new driver without going through the lengthy shutdown/reboot cycle each time.

作为程序员,您知道应用程序可以调用它未定义的函数:链接阶段使用适当的函数库解析外部引用。printf是这些可调用函数之一,在libc中定义。另一方面,模块仅链接到内核,并且它唯一可以调用的是内核导出的函数;没有可链接的库。例如,之前hello.c中使用的printk函数 是在内核中定义并导出到模块的printf 版本。它的行为与原始函数类似,但有一些细微的差异,主要的差异是缺乏浮点支持。

As a programmer, you know that an application can call functions it doesn't define: the linking stage resolves external references using the appropriate library of functions. printf is one of those callable functions and is defined in libc. A module, on the other hand, is linked only to the kernel, and the only functions it can call are the ones exported by the kernel; there are no libraries to link to. The printk function used in hello.c earlier, for example, is the version of printf defined within the kernel and exported to modules. It behaves similarly to the original function, with a few minor differences, the main one being lack of floating-point support.

图 2-1显示了如何在模块中使用函数调用和函数指针来向正在运行的内核添加新功能。

Figure 2-1 shows how function calls and function pointers are used in a module to add new functionality to a running kernel.

将模块链接到内核

图 2-1。将模块链接到内核

Figure 2-1. Linking a module to the kernel

由于没有库链接到模块,因此源文件不应包含通常的头文件<stdarg.h>,非常特殊的情况是唯一的例外。只有实际上属于内核本身一部分的函数才可以在内核模块中使用。与内核相关的任何内容都在您已设置和配置的内核源代码树中找到的标头中声明;大多数相关头文件位于include/linuxinclude/asm中,但include的其他子目录已添加到与特定内核子系统相关的主机材料中。

Because no library is linked to modules, source files should never include the usual header files, <stdarg.h> and very special situations being the only exceptions. Only functions that are actually part of the kernel itself may be used in kernel modules. Anything related to the kernel is declared in headers found in the kernel source tree you have set up and configured; most of the relevant headers live in include/linux and include/asm, but other subdirectories of include have been added to host material associated to specific kernel subsystems.

本书中介绍了各个内核头文件的作用,因为每个内核头文件都是需要的。

The role of individual kernel headers is introduced throughout the book as each of them is needed.

内核编程和应用程序编程之间的另一个重要区别在于每个环境如何处理 错误:虽然分段错误在应用程序开发过程中是无害的,并且始终可以使用调试器来跟踪源代码中的问题,但内核错误至少会杀死当前进程(如果不是整个系统)。我们将在第 4 章中了解如何跟踪内核错误。

Another important difference between kernel programming and application programming is in how each environment handles faults: whereas a segmentation fault is harmless during application development and a debugger can always be used to trace the error to the problem in the source code, a kernel fault kills the current process at least, if not the whole system. We see how to trace kernel errors in Chapter 4.

用户空间和内核空间

User Space and Kernel Space

模块在内核空间中运行,而应用程序 运行在用户空间。这个概念是操作系统理论的基础。

A module runs in kernel space, whereas applications run in user space. This concept is at the base of operating systems theory.

实际上,操作系统的作用是为程序提供计算机硬件的一致视图。此外,操作系统必须考虑程序的独立运行并防止未经授权的资源访问。只有当 CPU 强制保护系统软件免受应用程序的影响时,这项重要的任务才可能实现。

The role of the operating system, in practice, is to provide programs with a consistent view of the computer's hardware. In addition, the operating system must account for independent operation of programs and protection against unauthorized access to resources. This nontrivial task is possible only if the CPU enforces protection of system software from the applications.

每个现代处理器都能够强制执行此行为。所选择的方法是在 CPU 本身中实现不同的操作模式(或级别)。各个级别有不同的作用,有些操作在较低级别是不允许的;程序代码只能通过有限数量的门从一个级别切换到另一级别。Unix 系统旨在利用此硬件功能,使用两个这样的级别。当前所有处理器都至少有两个保护级别,有些处理器(例如 x86 系列)有更多级别;当存在多个级别时,使用最高和最低级别。Unix下,内核执行在最高层(也称为 supervisor模式) ),一切都被允许,而应用程序在最低级别执行(所谓的用户模式 ),其中处理器调节对硬件的直接访问和对内存的未经授权的访问。

Every modern processor is able to enforce this behavior. The chosen approach is to implement different operating modalities (or levels) in the CPU itself. The levels have different roles, and some operations are disallowed at the lower levels; program code can switch from one level to another only through a limited number of gates. Unix systems are designed to take advantage of this hardware feature, using two such levels. All current processors have at least two protection levels, and some, like the x86 family, have more levels; when several levels exist, the highest and lowest levels are used. Under Unix, the kernel executes in the highest level (also called supervisor mode ), where everything is allowed, whereas applications execute in the lowest level (the so-called user mode ), where the processor regulates direct access to hardware and unauthorized access to memory.

我们通常指的是执行模式为内核空间用户空间。这些术语不仅包含两种模式固有的不同特权级别,还包含每种模式也可以拥有自己的内存映射(自己的地址空间)这一事实。

We usually refer to the execution modes as kernel space and user space. These terms encompass not only the different privilege levels inherent in the two modes, but also the fact that each mode can have its own memory mapping—its own address space—as well.

每当应用程序发出系统调用或被硬件中断挂起时,Unix 就会将执行从用户空间转移到内核空间。执行系统调用的内核代码在进程的上下文中工作——它代表调用进程进行操作,并且能够访问进程地址空间中的数据。另一方面,处理中断的代码相对于进程是异步的,并且与任何特定进程无关。

Unix transfers execution from user space to kernel space whenever an application issues a system call or is suspended by a hardware interrupt. Kernel code executing a system call is working in the context of a process—it operates on behalf of the calling process and is able to access data in the process's address space. Code that handles interrupts, on the other hand, is asynchronous with respect to processes and is not related to any particular process.

模块的作用是扩展内核功能;模块化代码在内核空间中运行。通常,驱动程序执行前面概述的两项任务:模块中的某些函数作为系统调用的一部分执行,某些函数负责中断处理。

The role of a module is to extend kernel functionality; modularized code runs in kernel space. Usually a driver performs both the tasks outlined previously: some functions in the module are executed as part of system calls, and some are in charge of interrupt handling.

内核中的并发

Concurrency in the Kernel

内核的一种方式 编程与传统应用程序编程的很大不同在于并发问题。大多数应用程序(多线程应用程序除外)通常从头到尾按顺序运行,无需担心可能会发生其他情况来更改其环境。内核代码并不是在如此简单的世界中运行,即使是最简单的内核模块也必须以许多事情可以同时发生的想法来编写。

One way in which kernel programming differs greatly from conventional application programming is the issue of concurrency. Most applications, with the notable exception of multithreading applications, typically run sequentially, from the beginning to the end, without any need to worry about what else might be happening to change their environment. Kernel code does not run in such a simple world, and even the simplest kernel modules must be written with the idea that many things can be happening at once.

内核编程中有一些并发源。当然,Linux 系统运行多个进程,其中多个进程可能会同时尝试使用您的驱动程序。大多数设备都能够中断处理器;中断处理程序异步运行,可以在驱动程序尝试执行其他操作的同时调用。一些软件抽象(例如第 7 章中介绍的内核定时器)也是异步运行的。此外,当然,Linux可以运行在 对称多处理器 (SMP) 系统,其结果是您的驱动程序可以在多个 CPU 上同时执行。最后,在 2.6 中,内核代码已成为可抢占的;这种变化甚至导致单处理器系统具有许多与多处理器系统相同的并发问题。

There are a few sources of concurrency in kernel programming. Naturally, Linux systems run multiple processes, more than one of which can be trying to use your driver at the same time. Most devices are capable of interrupting the processor; interrupt handlers run asynchronously and can be invoked at the same time that your driver is trying to do something else. Several software abstractions (such as kernel timers, introduced in Chapter 7) run asynchronously as well. Moreover, of course, Linux can run on symmetric multiprocessor (SMP) systems, with the result that your driver could be executing concurrently on more than one CPU. Finally, in 2.6, kernel code has been made preemptible; this change causes even uniprocessor systems to have many of the same concurrency issues as multiprocessor systems.

因此,Linux 内核代码,包括驱动程序代码,必须是可重入的 ——它必须能够同时在多个上下文中运行。数据结构必须经过精心设计,以保持多个执行线程分离,并且代码必须注意以防止数据损坏的方式访问共享数据。编写处理并发并避免竞争条件(不幸的执行顺序导致不良行为的情况)的代码需要深思熟虑,并且可能很棘手。编写正确的内核代码需要正确的并发管理;因此,本书中的每个示例驱动程序在编写时都考虑到了并发性。当我们接触到所使用的技术时,我们会对其进行解释;第5章也致力于解决这个问题以及可用于并发管理的内核原语。

As a result, Linux kernel code, including driver code, must be reentrant —it must be capable of running in more than one context at the same time. Data structures must be carefully designed to keep multiple threads of execution separate, and the code must take care to access shared data in ways that prevent corruption of the data. Writing code that handles concurrency and avoids race conditions (situations in which an unfortunate order of execution causes undesirable behavior) requires thought and can be tricky. Proper management of concurrency is required to write correct kernel code; for that reason, every sample driver in this book has been written with concurrency in mind. The techniques used are explained as we come to them; Chapter 5 has also been dedicated to this issue and the kernel primitives available for concurrency management.

驱动程序程序员常犯的一个错误是假设只要特定代码段不进入休眠(或“阻塞”)状态,并发就不是问题。即使在以前的内核(不是抢占式的)中,这种假设在多处理器系统上也是无效的。在 2.6 中,内核代码(几乎)永远不会假设它可以在给定的代码段上控制处理器。如果您在编写代码时没有考虑到并发性,那么它将遭受灾难性的后果 调试起来极其困难的故障。

A common mistake made by driver programmers is to assume that concurrency is not a problem as long as a particular segment of code does not go to sleep (or "block"). Even in previous kernels (which were not preemptive), this assumption was not valid on multiprocessor systems. In 2.6, kernel code can (almost) never assume that it can hold the processor over a given stretch of code. If you do not write your code with concurrency in mind, it will be subject to catastrophic failures that can be exceedingly difficult to debug.

目前的流程

The Current Process

虽然内核模块 不像应用程序那样顺序执行,内核执行的大多数操作都是代表特定进程完成的。内核代码可以通过访问<asm/current.h>current中定义的全局项来引用当前进程,该全局项产生一个指向<linux/sched.h>定义的 的指针。该 指针指向当前正在执行的进程。在执行系统调用(例如openread )期间,当前进程是调用该调用的进程。如果需要,内核代码可以通过使用 来使用特定于进程的信息。该技术的一个例子在struct task_structcurrentcurrent第6章

Although kernel modules don't execute sequentially as applications do, most actions performed by the kernel are done on behalf of a specific process. Kernel code can refer to the current process by accessing the global item current, defined in <asm/current.h>, which yields a pointer to struct task_struct, defined by <linux/sched.h>. The current pointer refers to the process that is currently executing. During the execution of a system call, such as open or read, the current process is the one that invoked the call. Kernel code can use process-specific information by using current, if it needs to do so. An example of this technique is presented in Chapter 6.

实际上,current并不是真正的全局变量。支持 SMP 系统的需要迫使内核开发人员开发一种机制来查找相关 CPU 上的当前进程。该机制还必须快速,因为引用会current频繁发生。结果是一种依赖于体系结构的机制,通常隐藏指向task_struct内核堆栈上的结构的指针。不过,实现的细节对其他内核子系统来说仍然是隐藏的,并且设备驱动程序可以只包含<linux/sched.h>并引用该 current进程。例如,以下语句通过访问中的某些字段来打印当前进程的进程ID和命令名struct task_struct

Actually, current is not truly a global variable. The need to support SMP systems forced the kernel developers to develop a mechanism that finds the current process on the relevant CPU. This mechanism must also be fast, since references to current happen frequently. The result is an architecture-dependent mechanism that, usually, hides a pointer to the task_struct structure on the kernel stack. The details of the implementation remain hidden to other kernel subsystems though, and a device driver can just include <linux/sched.h> and refer to the current process. For example, the following statement prints the process ID and the command name of the current process by accessing certain fields in struct task_struct:

printk(KERN_INFO "进程是 \"%s\" (pid %i)\n",
        当前->comm,当前->pid);
printk(KERN_INFO "The process is \"%s\" (pid %i)\n",
        current->comm, current->pid);

存储的命令名称current->comm是当前进程正在执行的程序文件的基本名称(如果需要,可修剪到 15 个字符)。

The command name stored in current->comm is the base name of the program file (trimmed to 15 characters if need be) that is being executed by the current process.

其他一些细节

A Few Other Details

内核编程在很多方面与用户空间编程不同。我们将在本书中讨论这些问题时指出这些问题,但有一些基本问题虽然不值得单独专门讨论,但值得一提。因此,当您深入研究内核时,应牢记以下问题。

Kernel programming differs from user-space programming in many ways. We'll point things out as we get to them over the course of the book, but there are a few fundamental issues which, while not warranting a section of their own, are worth a mention. So, as you dig into the kernel, the following issues should be kept in mind.

应用程序布置在具有非常大的堆栈区域的虚拟内存中。当然,堆栈用于保存函数调用历史记录以及当前活动函数创建的所有自动变量。相反,内核的堆栈非常小;它可以小到单个 4096 字节的页面。您的函数必须与整个内核空间调用链共享该堆栈。因此,声明大型自动变量绝不是一个好主意。如果您需要更大的结构,您应该在调用时动态分配它们。

Applications are laid out in virtual memory with a very large stack area. The stack, of course, is used to hold the function call history and all automatic variables created by currently active functions. The kernel, instead, has a very small stack; it can be as small as a single, 4096-byte page. Your functions must share that stack with the entire kernel-space call chain. Thus, it is never a good idea to declare large automatic variables; if you need larger structures, you should allocate them dynamically at call time.

通常,当您查看内核 API 时,您会遇到以双精度字符开头的函数名称 下划线 ( _ _)。如此标记的函数通常是接口的低级组件,应谨慎使用。本质上,双下划线对程序员说:“如果你调用这个函数,请确保你知道你在做什么。”

Often, as you look at the kernel API, you will encounter function names starting with a double underscore (_ _). Functions so marked are generally a low-level component of the interface and should be used with caution. Essentially, the double underscore says to the programmer: "If you call this function, be sure you know what you are doing."

内核代码不能进行浮点运算。启用浮点需要内核在每次进入和退出内核空间时保存和恢复浮点处理器的状态——至少在某些体系结构上是这样。鉴于内核中确实不需要浮点 代码,额外的开销是不值得的。

Kernel code cannot do floating point arithmetic. Enabling floating point would require that the kernel save and restore the floating point processor's state on each entry to, and exit from, kernel space—at least, on some architectures. Given that there really is no need for floating point in kernel code, the extra overhead is not worthwhile.

编译和加载

Compiling and Loading

本章开头的“hello world”示例包括构建模块并将其加载到系统中的简短演示。当然,整个过程的内容比我们迄今为止看到的要多得多。本节提供有关模块作者如何将源代码转换为内核中的执行子系统的更多详细信息。

The "hello world" example at the beginning of this chapter included a brief demonstration of building a module and loading it into the system. There is, of course, a lot more to that whole process than we have seen so far. This section provides more detail on how a module author turns source code into an executing subsystem within the kernel.

编译模块

Compiling Modules

作为第一步,我们需要 看看如何构建模块。模块的构建过程与用户空间应用程序的构建过程有很大不同;内核是一个大型的独立程序,对其各部分的组合方式有详细而明确的要求。构建过程也不同于以前版本的内核的构建过程。新的构建系统使用起来更简单,并产生更正确的结果,但它看起来与以前的非常不同。内核构建系统是一个复杂的野兽,我们只关注它的一小部分。对于任何想要了解表面之下真正发生的事情的人来说,都需要阅读内核源中Documentation/kbuild目录中的文件。

As the first step, we need to look a bit at how modules must be built. The build process for modules differs significantly from that used for user-space applications; the kernel is a large, standalone program with detailed and explicit requirements on how its pieces are put together. The build process also differs from how things were done with previous versions of the kernel; the new build system is simpler to use and produces more correct results, but it looks very different from what came before. The kernel build system is a complex beast, and we just look at a tiny piece of it. The files found in the Documentation/kbuild directory in the kernel source are required reading for anybody wanting to understand all that is really going on beneath the surface.

在构建内核模块之前,您必须先解决一些先决条件。首先是确保您拥有足够最新版本的编译器、模块实用程序和其他必要工具。文件文档/更改在内核文档目录中始终列出所需的工具版本;在继续之前你应该先咨询一下。尝试使用错误的工具版本构建内核(及其模块)可能会导致无穷无尽的微妙、困难的问题。请注意,有时太新的编译器版本可能与太旧的编译器版本一样有问题。内核源代码对编译器做了很多假设,新版本有时会暂时破坏一些东西。

There are some prerequisites that you must get out of the way before you can build kernel modules. The first is to ensure that you have sufficiently current versions of the compiler, module utilities, and other necessary tools. The file Documentation/Changes in the kernel documentation directory always lists the required tool versions; you should consult it before going any further. Trying to build a kernel (and its modules) with the wrong tool versions can lead to no end of subtle, difficult problems. Note that, occasionally, a version of the compiler that is too new can be just as problematic as one that is too old; the kernel source makes a great many assumptions about the compiler, and new releases can sometimes break things for a while.

如果您仍然没有方便的内核树,或者尚未配置和构建该内核,那么现在是时候去做了。如果文件系统上没有此树,则无法为 2.6 内核构建可加载模块。实际运行您正在构建的内核也很有帮助(尽管不是必需的)。

If you still do not have a kernel tree handy, or have not yet configured and built that kernel, now is the time to go do it. You cannot build loadable modules for a 2.6 kernel without this tree on your filesystem. It is also helpful (though not required) to be actually running the kernel that you are building for.

一旦完成所有设置,为模块创建 makefile 就很简单了。事实上,对于本章前面显示的“hello world”示例,一行就足够了:

Once you have everything set up, creating a makefile for your module is straightforward. In fact, for the "hello world" example shown earlier in this chapter, a single line will suffice:

obj-m := 你好.o
obj-m := hello.o

熟悉make但不熟悉 2.6 内核构建系统的读者可能想知道这个 makefile 是如何工作的。毕竟,上面的行并不是传统 makefile 的样子。答案当然是内核构建系统会处理其余的事情。上面的赋值(利用了 GNU make提供的扩展语法)表明有一个模块需要从目标文件hello.o构建。从目标文件构建后生成的模块被命名为hello.ko 。

Readers who are familiar with make, but not with the 2.6 kernel build system, are likely to be wondering how this makefile works. The above line is not how a traditional makefile looks, after all. The answer, of course, is that the kernel build system handles the rest. The assignment above (which takes advantage of the extended syntax provided by GNU make) states that there is one module to be built from the object file hello.o. The resulting module is named hello.ko after being built from the object file.

相反,如果您有一个名为module.ko的模块 ,它是从两个源文件(例如file1.cfile2.c)生成的,则正确的咒语是:

If, instead, you have a module called module.ko that is generated from two source files (called, say, file1.c and file2.c), the correct incantation would be:

obj-m := 模块.o
模块-objs := file1.o file2.o
obj-m := module.o
module-objs := file1.o file2.o

为了使上面显示的 makefile 能够工作,必须在更大的内核构建系统的上下文中调用它。如果您的内核源代码树位于 ~/kernel-2.6目录中, 则make 构建模块所需的命令(在包含模块源代码和 makefile 的目录中键入)将是:

For a makefile like those shown above to work, it must be invoked within the context of the larger kernel build system. If your kernel source tree is located in, say, your ~/kernel-2.6 directory, the make command required to build your module (typed in the directory containing the module source and makefile) would be:

make -C ~/kernel-2.6 M=`pwd` 模块
make -C ~/kernel-2.6 M=`pwd` modules

该命令首先将其目录更改为该-C选项提供的目录(即您的内核源目录)。它在那里找到内核的顶级 makefile。该M=选项会导致 makefile 在尝试构建目标之前移回到模块源目录中modules。该目标又引用变量中找到的模块列表obj-m,我们在示例中将其设置为module.o 。

This command starts by changing its directory to the one provided with the -C option (that is, your kernel source directory). There it finds the kernel's top-level makefile. The M= option causes that makefile to move back into your module source directory before trying to build the modules target. This target, in turn, refers to the list of modules found in the obj-m variable, which we've set to module.o in our examples.

键入前面的make命令一段时间后可能会变得令人厌烦,因此内核开发人员开发了一种 makefile 习惯用法,这使得那些在内核树之外构建模块的人的生活更轻松。诀窍是写你的 生成文件如下:

Typing the previous make command can get tiresome after a while, so the kernel developers have developed a sort of makefile idiom, which makes life easier for those building modules outside of the kernel tree. The trick is to write your makefile as follows:

# 如果定义了 KERNELRELEASE,我们就会被调用
# 内核构建系统并可以使用其语言。
ifneq ($(KERNELRELEASE),)
    obj-m := 你好.o

# 否则我们会直接从命令调用
# 线; 调用内核构建系统。
别的

    KERNELDIR ?= /lib/modules/$(shell uname -r)/build
    PWD := $(shell 密码)

默认:
    $(MAKE) -C $(KERNELDIR) M=$(PWD) 模块

万一
# If KERNELRELEASE is defined, we've been invoked from the
# kernel build system and can use its language.
ifneq ($(KERNELRELEASE),)
    obj-m := hello.o 

# Otherwise we were called directly from the command
# line; invoke the kernel build system.
else

    KERNELDIR ?= /lib/modules/$(shell uname -r)/build
    PWD  := $(shell pwd)

default:
    $(MAKE) -C $(KERNELDIR) M=$(PWD) modules

endif

我们再次看到扩展的 GNU make语法的实际应用。在典型的构建中,该 makefile 会被读取两次。当从命令行调用 makefile 时,它​​会注意到该KERNELRELEASE 变量尚未设置。它利用已安装模块目录中的符号链接构建指向内核构建树的事实来定位内核源目录。如果您实际上没有运行正在构建的内核,则可以KERNELDIR=在命令行上提供一个选项,设置环境变量,或重写在 makefile 中KERNELDIR设置的行。KERNELDIR一旦找到内核源代码树,makefile 就会调用default:目标,它运行第二个make命令(在 makefile 中参数化为 $(MAKE))以调用内核构建系统,如前所述。在第二次阅读时,makefile 设置了obj-m,并且内核 makefile 负责实际构建模块。

Once again, we are seeing the extended GNU make syntax in action. This makefile is read twice on a typical build. When the makefile is invoked from the command line, it notices that the KERNELRELEASE variable has not been set. It locates the kernel source directory by taking advantage of the fact that the symbolic link build in the installed modules directory points back at the kernel build tree. If you are not actually running the kernel that you are building for, you can supply a KERNELDIR= option on the command line, set the KERNELDIR environment variable, or rewrite the line that sets KERNELDIR in the makefile. Once the kernel source tree has been found, the makefile invokes the default: target, which runs a second make command (parameterized in the makefile as $(MAKE)) to invoke the kernel build system as described previously. On the second reading, the makefile sets obj-m, and the kernel makefiles take care of actually building the module.

这种构建模块的机制可能会让您觉得有点笨拙和晦涩。然而,一旦您习惯了它,您可能会欣赏已编程到内核构建系统中的功能。请注意,上面的内容并不是完整的 makefile;真正的 makefile 包括通常类型的目标,用于清理不需要的文件、安装模块、 等。有关完整示例,请参阅示例源目录中的 makefile。

This mechanism for building modules may strike you as a bit unwieldy and obscure. Once you get used to it, however, you will likely appreciate the capabilities that have been programmed into the kernel build system. Do note that the above is not a complete makefile; a real makefile includes the usual sort of targets for cleaning up unneeded files, installing modules, etc. See the makefiles in the example source directory for a complete example.

加载和卸载模块

Loading and Unloading Modules

模块构建完成后, 下一步是将其加载到内核中。正如我们已经指出的, insmod可以为您完成这项工作。该程序将模块代码和数据加载到内核中,内核依次执行类似于 ld的功能,因为它将模块中任何未解析的符号链接到内核的符号表。然而,与链接器不同的是,内核不会修改模块的磁盘文件,而是修改内存中的副本。插入模块接受许多命令行选项(有关详细信息,请参阅联机帮助页),并且它可以在将模块链接到当前内核之前为模块中的参数赋值。因此,如果模块设计正确,则可以在加载时对其进行配置;加载时配置为用户提供了比编译时配置更大的灵活性,编译时配置有时仍在使用。加载时配置将在本章后面的2.8 节中解释。

After the module is built, the next step is loading it into the kernel. As we've already pointed out, insmod does the job for you. The program loads the module code and data into the kernel, which, in turn, performs a function similar to that of ld, in that it links any unresolved symbol in the module to the symbol table of the kernel. Unlike the linker, however, the kernel doesn't modify the module's disk file, but rather an in-memory copy. insmod accepts a number of command-line options (for details, see the manpage), and it can assign values to parameters in your module before linking it to the current kernel. Thus, if a module is correctly designed, it can be configured at load time; load-time configuration gives the user more flexibility than compile-time configuration, which is still used sometimes. Load-time configuration is explained in Section 2.8 later in this chapter.

有兴趣的读者可能想看看内核如何支持insmod:它依赖于kernel/module.c中定义的系统调用。函数 sys_init_module分配内核内存来保存module(该内存是用vmalloc分配的;请参阅第 2 章中的8.4 节);然后,它将模块文本复制到该内存区域,通过内核符号表解析模块中的内核引用,并调用模块的初始化函数以使一切顺利进行。

Interested readers may want to look at how the kernel supports insmod: it relies on a system call defined in kernel/module.c. The function sys_init_module allocates kernel memory to hold a module (this memory is allocated with vmalloc ; see the Section 8.4 in Chapter 2); it then copies the module text into that memory region, resolves kernel references in the module via the kernel symbol table, and calls the module's initialization function to get everything going.

如果您实际查看内核源代码,您会发现系统调用的名称以sys_. 对于所有系统调用都是如此,但对于其他函数则不然;在查找源代码中的系统调用时记住这一点很有用。

If you actually look in the kernel source, you'll find that the names of the system calls are prefixed with sys_. This is true for all system calls and no other functions; it's useful to keep this in mind when grepping for the system calls in the sources.

modprobe实用程序值得一提。 modprobe与insmod类似,将模块加载到内核中。它的不同之处在于,它将查看要加载的模块,以查看它是否引用了当前未在内核中定义的任何符号。如果找到任何此类引用, modprobe会在当前模块搜索路径中查找定义相关符号的其他模块。当modprobe 找到这些模块(正在加载的模块需要这些模块)时,它也会将它们加载到内核中。如果你使用insmod在这种情况下,命令会失败,并在系统日志文件中留下“未解析的符号”消息。

The modprobe utility is worth a quick mention. modprobe, like insmod, loads a module into the kernel. It differs in that it will look at the module to be loaded to see whether it references any symbols that are not currently defined in the kernel. If any such references are found, modprobe looks for other modules in the current module search path that define the relevant symbols. When modprobe finds those modules (which are needed by the module being loaded), it loads them into the kernel as well. If you use insmod in this situation instead, the command fails with an "unresolved symbols" message left in the system logfile.

如前所述,可以使用 rmmod实用程序从内核中删除模块。请注意,如果出现以下情况,模块删除将会失败 内核认为该模块仍在使用中(例如,程序仍然具有由模块导出的设备的打开文件),或者内核已配置为不允许删除模块。可以将内核配置为允许“强制”删除模块,即使它们看起来很忙。然而,如果您正在考虑使用此选项,那么情况可能已经非常严重,以至于重新启动很可能是更好的做法。

As mentioned before, modules may be removed from the kernel with the rmmod utility. Note that module removal fails if the kernel believes that the module is still in use (e.g., a program still has an open file for a device exported by the modules), or if the kernel has been configured to disallow module removal. It is possible to configure the kernel to allow "forced" removal of modules, even when they appear to be busy. If you reach a point where you are considering using this option, however, things are likely to have gone wrong badly enough that a reboot may well be the better course of action.

lsmod程序生成当前加载到内核中的模块的列表。还提供了一些其他信息,例如使用特定模块的任何其他模块。lsmod通过读取/proc/modules虚拟文件来工作。有关当前加载模块的信息也可以在/sys/module下的 sysfs 虚拟文件系统中找到。

The lsmod program produces a list of the modules currently loaded in the kernel. Some other information, such as any other modules making use of a specific module, is also provided. lsmod works by reading the /proc/modules virtual file. Information on currently loaded modules can also be found in the sysfs virtual filesystem under /sys/module.

版本依赖

Version Dependency

请记住,您的模块的 代码必须针对其链接到的每个版本的内核重新编译 - 至少在没有 modversion 的情况下,此处不进行介绍,因为它们更多地针对发行版制造商而不是开发人员。模块与特定内核版本中定义的数据结构和函数原型紧密相关;从一个内核版本到下一个内核版本,模块所看到的接口可能会发生显着变化。当然,对于开发内核来说尤其如此。

Bear in mind that your module's code has to be recompiled for each version of the kernel that it is linked to—at least, in the absence of modversions, not covered here as they are more for distribution makers than developers. Modules are strongly tied to the data structures and function prototypes defined in a particular kernel version; the interface seen by a module can change significantly from one kernel version to the next. This is especially true of development kernels, of course.

内核不仅仅假设给定的模块是根据正确的内核版本构建的。构建过程中的步骤之一是将模块链接到当前内核树中的文件(称为vermagic.o );该对象包含有关模块构建的内核的大量信息,包括目标内核版本、编译器版本以及许多重要配置变量的设置。当尝试加载模块时,可以测试此信息与正在运行的内核的兼容性。如果不匹配,则不会加载该模块;相反,您会看到类似以下内容:

The kernel does not just assume that a given module has been built against the proper kernel version. One of the steps in the build process is to link your module against a file (called vermagic.o) from the current kernel tree; this object contains a fair amount of information about the kernel the module was built for, including the target kernel version, compiler version, and the settings of a number of important configuration variables. When an attempt is made to load a module, this information can be tested for compatibility with the running kernel. If things don't match, the module is not loaded; instead, you see something like:

#insmod hello.ko
插入 './hello.ko' 时出错:-1 模块格式无效
# insmod hello.ko
Error inserting './hello.ko': -1 Invalid module format

查看系统日志文件(/var/log/messages 或您的系统配置使用的任何文件)将显示 导致模块无法加载的具体问题。

A look in the system log file (/var/log/messages or whatever your system is configured to use) will reveal the specific problem that caused the module to fail to load.

如果您需要为特定内核版本编译模块,则需要使用该特定版本的构建系统和源代码树。对前面显示的示例 makefile 中的变量进行简单更改 KERNELDIR即可达到目的。

If you need to compile a module for a specific kernel version, you will need to use the build system and source tree for that particular version. A simple change to the KERNELDIR variable in the example makefile shown previously does the trick.

内核接口经常版本之间的更改。如果您正在编写一个旨在与多个版本的内核一起使用的模块(特别是如果它必须跨主要版本工作),那么您可能必须使用宏和构造来正确构建代码#ifdef。本书的这一版本仅涉及内核的一个主要版本,因此您不会经常在我们的示例代码中看到版本测试。但对它们的需求确实偶尔会出现。在这种情况下,您需要使用linux/version.h中的定义。该头文件由 linux/module.h自动包含,定义了以下内容 宏:

Kernel interfaces often change between releases. If you are writing a module that is intended to work with multiple versions of the kernel (especially if it must work across major releases), you likely have to make use of macros and #ifdef constructs to make your code build properly. This edition of this book only concerns itself with one major version of the kernel, so you do not often see version tests in our example code. But the need for them does occasionally arise. In such cases, you want to make use of the definitions found in linux/version.h. This header file, automatically included by linux/module.h, defines the following macros:

UTS_RELEASE
UTS_RELEASE

该宏扩展为描述该内核树版本的字符串。例如,“ 2.6.10”。

This macro expands to a string describing the version of this kernel tree. For example, "2.6.10".

LINUX_VERSION_CODE
LINUX_VERSION_CODE

该宏扩展为内核版本的二进制表示形式,版本发行号的每个部分一个字节。例如,2.6.10 的代码是 132618(即 0x02060a)。[ 2 ]有了这些信息,您(几乎)可以轻松确定您正在处理的内核版本。

This macro expands to the binary representation of the kernel version, one byte for each part of the version release number. For example, the code for 2.6.10 is 132618 (i.e., 0x02060a).[2] With this information, you can (almost) easily determine what version of the kernel you are dealing with.

KERNEL_VERSION(major,minor,release)
KERNEL_VERSION(major,minor,release)

这是用于根据构建版本号的各个数字构建整数版本代码的宏。例如,KERNEL_VERSION(2,6,10)扩展为 132618。当您需要比较当前版本和已知检查点时,该宏非常有用。

This is the macro used to build an integer version code from the individual numbers that build up a version number. For example, KERNEL_VERSION(2,6,10) expands to 132618. This macro is very useful when you need to compare the current version and a known checkpoint.

KERNEL_VERSION大多数基于内核版本的依赖项都可以通过利用和来使用预处理器条件来解决LINUX_VERSION_CODE。但是,版本依赖性不应使驱动程序代码与复杂的#ifdef条件混淆;处理不兼容性的最佳方法是将它们限制在特定的头文件中。作为一般规则,明确依赖于版本(或平台)的代码应隐藏在低级宏或函数后面。然后,高级代码可以只调用这些函数,而无需关心低级细节。以这种方式编写的代码往往更容易阅读并且更健壮。

Most dependencies based on the kernel version can be worked around with preprocessor conditionals by exploiting KERNEL_VERSION and LINUX_VERSION_CODE. Version dependency should, however, not clutter driver code with hairy #ifdef conditionals; the best way to deal with incompatibilities is by confining them to a specific header file. As a general rule, code which is explicitly version (or platform) dependent should be hidden behind a low-level macro or function. High-level code can then just call those functions without concern for the low-level details. Code written in this way tends to be easier to read and more robust.

平台依赖性

Platform Dependency

每个计算机平台都有其 特性,并且内核设计者可以自由地利用所有特性来在目标目标文件中实现更好的性能。

Each computer platform has its peculiarities, and kernel designers are free to exploit all the peculiarities to achieve better performance in the target object file.

与应用程序开发人员必须将其代码与预编译库链接并遵守参数传递约定不同,内核开发人员可以将一些处理器寄存器专用于特定角色,他们已经这样做了。此外,内核代码可以针对 CPU 系列中的特定处理器进行优化,以充分利用目标平台:与通常以二进制格式分发的应用程序不同,内核的自定义编译可以针对特定计算机集进行优化。

Unlike application developers, who must link their code with precompiled libraries and stick to conventions on parameter passing, kernel developers can dedicate some processor registers to specific roles, and they have done so. Moreover, kernel code can be optimized for a specific processor in a CPU family to get the best from the target platform: unlike applications that are often distributed in binary format, a custom compilation of the kernel can be optimized for a specific computer set.

例如,IA32(x86)架构已细分为几种不同的处理器类型。旧的 80386 处理器仍然受支持(目前),尽管按照现代标准,其指令集相当有限。此架构中更现代的处理器引入了许多新功能,包括用于进入内核的更快指令、处理器间锁定、复制数据等。较新的处理器在正确模式下运行时还可以采用 36 位(或更大) ) 物理地址,允许它们寻址超过 4 GB 的物理内存。其他处理器系列也有类似的改进。根据各种配置选项,可以构建内核来利用这些附加功能。

For example, the IA32 (x86) architecture has been subdivided into several different processor types. The old 80386 processor is still supported (for now), even though its instruction set is, by modern standards, quite limited. The more modern processors in this architecture have introduced a number of new capabilities, including faster instructions for entering the kernel, interprocessor locking, copying data, etc. Newer processors can also, when operated in the correct mode, employ 36-bit (or larger) physical addresses, allowing them to address more than 4 GB of physical memory. Other processor families have seen similar improvements. The kernel, depending on various configuration options, can be built to make use of these additional features.

显然,如果一个模块要与给定的内核一起工作,则必须以与该内核相同的对目标处理器的理解来构建它。vermagic.o对象再次发挥作用。加载模块时,内核会检查该模块的处理器特定配置选项,并确保它们与正在运行的内核匹配。如果模块是使用不同选项编译的,则不会加载该模块。

Clearly, if a module is to work with a given kernel, it must be built with the same understanding of the target processor as that kernel was. Once again, the vermagic.o object comes in to play. When a module is loaded, the kernel checks the processor-specific configuration options for the module and makes sure they match the running kernel. If the module was compiled with different options, it is not loaded.

如果您打算编写通用驱动程序 发行版时,您可能想知道如何支持所有这些不同的变体。当然,最好的答案是在 GPL 兼容许可证下发布您的驱动程序并将其贡献给主线内核。如果做不到这一点,以源代码形式分发驱动程序和一组脚本以在用户系统上编译它可能是最好的答案。一些供应商已经发布了工具来简化此任务。如果必须以二进制形式分发驱动程序,则需要查看目标发行版提供的不同内核,并为每个内核提供模块的版本。请务必考虑自发行版生成以来可能发布的任何勘误内核。然后,第 1.6 节。作为一般规则,以源代码形式分发内容更容易 在世界上闯出一片天的方式。

If you are planning to write a driver for general distribution, you may well be wondering just how you can possibly support all these different variations. The best answer, of course, is to release your driver under a GPL-compatible license and contribute it to the mainline kernel. Failing that, distributing your driver in source form and a set of scripts to compile it on the user's system may be the best answer. Some vendors have released tools to make this task easier. If you must distribute your driver in binary form, you need to look at the different kernels provided by your target distributions, and provide a version of the module for each. Be sure to take into account any errata kernels that may have been released since the distribution was produced. Then, there are licensing issues to be considered, as we discussed in Section 1.6. As a general rule, distributing things in source form is an easier way to make your way in the world.

内核符号表

The Kernel Symbol Table

我们已经看到了insmod如何解决 公共内核符号表中未定义的符号。该表包含实现模块化驱动程序所需的全局内核项(函数和变量)的地址。当加载模块时,该模块导出的任何符号都会成为内核符号表的一部分。在通常情况下,模块实现自己的功能,根本不需要导出任何符号。但是,只要其他模块可以从使用符号中受益,您就需要导出符号。

We've seen how insmod resolves undefined symbols against the table of public kernel symbols. The table contains the addresses of global kernel items—functions and variables—that are needed to implement modularized drivers. When a module is loaded, any symbol exported by the module becomes part of the kernel symbol table. In the usual case, a module implements its own functionality without the need to export any symbols at all. You need to export symbols, however, whenever other modules may benefit from using them.

新模块可以使用模块导出的符号,并且您可以将新模块堆叠在其他模块之上。主流内核源代码中也实现了模块堆栈:msdos文件系统依赖于fat模块导出的符号,每个输入 USB 设备模块堆栈在usbcore输入模块上。

New modules can use symbols exported by your module, and you can stack new modules on top of other modules. Module stacking is implemented in the mainstream kernel sources as well: the msdos filesystem relies on symbols exported by the fat module, and each input USB device module stacks on the usbcore and input modules.

模块堆叠很有用 复杂的项目。如果一个新的抽象以设备驱动程序的形式实现,它可能会为特定于硬件的实现提供一个插件。例如,video-for-linux 驱动程序集被分成一个通用模块,该模块导出特定硬件的较低级别设备驱动程序使用的符号。根据您的设置,您可以加载通用视频模块和适用于已安装硬件的特定模块。对并行端口和各种可连接设备的支持以相同的方式处理,USB 内核子系统也是如此。并行端口子系统中的堆栈是如图2-2所示 ;箭头显示模块之间以及与内核编程接口的通信。

Module stacking is useful in complex projects. If a new abstraction is implemented in the form of a device driver, it might offer a plug for hardware-specific implementations. For example, the video-for-linux set of drivers is split into a generic module that exports symbols used by lower-level device drivers for specific hardware. According to your setup, you load the generic video module and the specific module for your installed hardware. Support for parallel ports and the wide variety of attachable devices is handled in the same way, as is the USB kernel subsystem. Stacking in the parallel port subsystem is shown in Figure 2-2; the arrows show the communications between the modules and with the kernel programming interface.

并行端口驱动器模块的堆叠

图 2-2。并行端口驱动器模块的堆叠

Figure 2-2. Stacking of parallel port driver modules

当使用堆叠模块时,了解 modprobe很有帮助 公用事业。正如我们之前所描述的,modprobe 的功能与insmod大致相同 ,但它还加载您要加载的模块所需的任何其他模块。因此,一个modprobe命令有时可以替代对insmod的多次调用(尽管从当前目录加载您自己的模块时仍然需要 insmod ,因为modprobe仅在标准安装的模块目录中查找)。

When using stacked modules, it is helpful to be aware of the modprobe utility. As we described earlier, modprobe functions in much the same way as insmod, but it also loads any other modules that are required by the module you want to load. Thus, one modprobe command can sometimes replace several invocations of insmod (although you'll still need insmod when loading your own modules from the current directory, because modprobe looks only in the standard installed module directories).

使用堆栈将模块分成多个层可以通过简化每一层来帮助减少开发时间。这类似于我们在第一章中讨论的机制和策略的分离。

Using stacking to split modules into multiple layers can help reduce development time by simplifying each layer. This is similar to the separation between mechanism and policy that we discussed in Chapter 1.

Linux 内核头文件提供了一种管理符号可见性的便捷方法,从而减少命名空间污染(使用可能与内核其他地方定义的名称冲突的名称填充命名空间)并促进正确的信息隐藏。如果您的模块需要导出供其他模块使用的符号,应使用以下宏。

The Linux kernel header files provide a convenient way to manage the visibility of your symbols, thus reducing namespace pollution (filling the namespace with names that may conflict with those defined elsewhere in the kernel) and promoting proper information hiding. If your module needs to export symbols for other modules to use, the following macros should be used.

EXPORT_SYMBOL(名称);
EXPORT_SYMBOL_GPL(名称);
EXPORT_SYMBOL(name);
EXPORT_SYMBOL_GPL(name);

上述任一宏都使给定符号在模块外部可用。该 _GPL版本使该符号仅可用于 GPL 许可的模块。符号必须在模块文件的全局部分(任何函数之外)中导出,因为宏会扩展为可全局访问的特殊用途变量的声明。该变量存储在模块可执行文件的一个特殊部分(“ELF 部分”)中,内核在加载时使用该部分来查找模块导出的变量。(有兴趣的读者可以查看<linux/module.h>了解详细信息,尽管不需要详细信息 让事情顺利进行。)

Either of the above macros makes the given symbol available outside the module. The _GPL version makes the symbol available to GPL-licensed modules only. Symbols must be exported in the global part of the module's file, outside of any function, because the macros expand to the declaration of a special-purpose variable that is expected to be accessible globally. This variable is stored in a special part of the module executible (an "ELF section") that is used by the kernel at load time to find the variables exported by the module. (Interested readers can look at <linux/module.h> for the details, even though the details are not needed to make things work.)

预赛

Preliminaries

我们越来越接近寻找 在一些实际的模块代码中。但首先,我们需要看看需要出现在模块源文件中的其他一些内容。内核是一个独特的环境,它对与其交互的代码提出了自己的要求。

We are getting closer to looking at some actual module code. But first, we need to look at some other things that need to appear in your module source files. The kernel is a unique environment, and it imposes its own requirements on code that would interface with it.

大多数内核代码最终都会包含相当多的头文件来获取函数、数据类型和变量的定义。我们将在接触这些文件时检查它们,但有一些文件是特定于模块的,并且必须出现在每个可加载模块中。因此,几乎所有模块代码都具有以下内容:

Most kernel code ends up including a fairly large number of header files to get definitions of functions, data types, and variables. We'll examine these files as we come to them, but there are a few that are specific to modules, and must appear in every loadable module. Thus, just about all module code has the following:

#include <linux/module.h>
#include <linux/init.h>
#include <linux/module.h>
#include <linux/init.h>

module.h包含可加载模块所需的符号和函数的大量定义。您需要init.h来指定初始化和清理函数,正如我们在上面的“hello world”示例中看到的那样,我们将在下一节中重新讨论它。大多数模块还包含moduleparam.h以允许在加载时向模块传递参数;我们很快就会讨论这个问题。

module.h contains a great many definitions of symbols and functions needed by loadable modules. You need init.h to specify your initialization and cleanup functions, as we saw in the "hello world" example above, and which we revisit in the next section. Most modules also include moduleparam.h to enable the passing of parameters to the module at load time; we will get to that shortly.

这并不是绝对必要的,但您的模块确实应该指定哪个许可证适用于其代码。这样做只需包含一行MODULE_LICENSE

It is not strictly necessary, but your module really should specify which license applies to its code. Doing so is just a matter of including a MODULE_LICENSE line:

MODULE_LICENSE(“GPL”);
MODULE_LICENSE("GPL");

内核识别的具体许可证有“GPL”(适用于任何版本的 GNU 通用公共许可证)、“GPL v2”(仅适用于 GPL 版本二)、“GPL 和附加权利”、“双 BSD/GPL”、“双 MPL/GPL”和“专有”。除非您的模块被明确标记为处于内核识别的免费许可证之下,否则它被认为是专有的,并且在加载模块时内核会被“污染”。正如我们在1.6 节中提到的,内核开发人员往往不热衷于帮助在加载专有模块后遇到问题的用户。

The specific licenses recognized by the kernel are "GPL" (for any version of the GNU General Public License), "GPL v2" (for GPL version two only), "GPL and additional rights," "Dual BSD/GPL," "Dual MPL/GPL," and "Proprietary." Unless your module is explicitly marked as being under a free license recognized by the kernel, it is assumed to be proprietary, and the kernel is "tainted" when the module is loaded. As we mentioned in Section 1.6, kernel developers tend to be unenthusiastic about helping users who experience problems after loading proprietary modules.

模块中可以包含的其他描述性定义包括MODULE_AUTHOR(说明模块的编写者)、MODULE_DESCRIPTION(模块功能的人类可读声明)、MODULE_VERSION(代码修订号;请参阅<linux/module.h>中的注释用于创建版本字符串时使用的约定),MODULE_ALIAS(可以知道该模块的另一个名称),以及MODULE_DEVICE_TABLE(告诉用户空间该模块支持哪些设备)。

Other descriptive definitions that can be contained within a module include MODULE_AUTHOR (stating who wrote the module), MODULE_DESCRIPTION (a human-readable statement of what the module does), MODULE_VERSION (for a code revision number; see the comments in <linux/module.h> for the conventions to use in creating version strings), MODULE_ALIAS (another name by which this module can be known), and MODULE_DEVICE_TABLE (to tell user space about which devices the module supports).

各种MODULE_声明可以出现在源文件中函数之外的任何位置。然而,内核代码中相对较新的约定是将这些声明放在文件末尾。

The various MODULE_ declarations can appear anywhere within your source file outside of a function. A relatively recent convention in kernel code, however, is to put these declarations at the end of the file.

初始化和关闭

Initialization and Shutdown

正如已经提到的,该模块 初始化函数注册模块提供的任何设施。通过 设施,我们指的是可以由应用程序访问的新功能,无论是整个驱动程序还是新的软件抽象。初始化函数的实际定义总是如下所示:

As already mentioned, the module initialization function registers any facility offered by the module. By facility, we mean a new functionality, be it a whole driver or a new software abstraction, that can be accessed by an application. The actual definition of the initialization function always looks like:

static int __init 初始化函数(void)
{
    /* 初始化代码在这里 */
}
module_init(初始化函数);
static int _ _init initialization_function(void)
{
    /* Initialization code here */
}
module_init(initialization_function);

应该声明初始化函数static,因为它们在特定文件之外不可见;不过,对此没有硬性规则,因为除非明确请求,否则不会将任何函数导出到内核的其余部分。定义中的token_ _init可能看起来有点奇怪;它向内核暗示给定的函数仅在初始化时使用。模块加载器在模块加载后删除初始化函数,使其内存可供其他用途。对于仅在初始化期间使用的数据,有一个类似的标签 ( _ _initdata)。使用 _ _init_ _initdata是可选的,但值得麻烦。请确保不要将它们用于初始化完成后将使用的任何函数(或数据结构)。您可能还会在内核源代码中遇到_ _devinitand ;仅当内核尚未配置为可热插拔设备时,_ _devinitdata这些才会转换。我们将在第 14 章讨论热插拔支持。_ _init_ _initdata

Initialization functions should be declared static, since they are not meant to be visible outside the specific file; there is no hard rule about this, though, as no function is exported to the rest of the kernel unless explicitly requested. The _ _init token in the definition may look a little strange; it is a hint to the kernel that the given function is used only at initialization time. The module loader drops the initialization function after the module is loaded, making its memory available for other uses. There is a similar tag (_ _initdata) for data used only during initialization. Use of _ _init and _ _initdata is optional, but it is worth the trouble. Just be sure not to use them for any function (or data structure) you will be using after initialization completes. You may also encounter _ _devinit and _ _devinitdata in the kernel source; these translate to _ _init and _ _initdata only if the kernel has not been configured for hotpluggable devices. We will look at hotplug support in Chapter 14.

module_init的使用 是强制性的。该宏向模块的目标代码添加一个特殊部分,说明在哪里可以找到模块的初始化函数。如果没有这个定义,你的初始化函数就永远不会被调用。

The use of module_init is mandatory. This macro adds a special section to the module's object code stating where the module's initialization function is to be found. Without this definition, your initialization function is never called.

模块可以注册许多不同类型的设施,包括不同类型的设备、文件系统、加密转换等等。对于每个设施,都有一个特定的内核函数来完成此注册。传递给内核注册函数的参数通常是指向描述新设施和正在注册的设施名称的数据结构的指针。数据结构通常包含指向模块函数的指针,这就是调用模块主体中的函数的方式。

Modules can register many different types of facilities, including different kinds of devices, filesystems, cryptographic transforms, and more. For each facility, there is a specific kernel function that accomplishes this registration. The arguments passed to the kernel registration functions are usually pointers to data structures describing the new facility and the name of the facility being registered. The data structure usually contains pointers to module functions, which is how functions in the module body get called.

可以注册的项目超出了第 1 章中提到的设备类型列表。其中包括串行端口、杂项设备、sysfs 条目、/proc文件、可执行域和行规则。许多可注册项目支持与硬件不直接相关的功能,但仍保留在“软件抽象”领域。这些项目可以注册,因为它们无论如何都集成到驱动程序的功能中(例如/proc文件和行规则)。

The items that can be registered go beyond the list of device types mentioned in Chapter 1. They include, among others, serial ports, miscellaneous devices, sysfs entries, /proc files, executable domains, and line disciplines. Many of those registrable items support functions that aren't directly related to hardware but remain in the "software abstractions" field. Those items can be registered, because they are integrated into the driver's functionality anyway (like /proc files and line disciplines for example).

还有其他功能可以注册为某些驱动程序的附加组件,但它们的用途非常具体,因此不值得讨论;他们使用堆叠技术,如第 2.5 节所述。EXPORT_SYMBOL如果您想进一步探究,您可以在内核源代码中进行grep 查找,并找到不同驱动程序提供的入口点。大多数注册函数都以 为前缀register_,因此找到它们的另一种可能方法是register_在内核源代码中 grep for 。

There are other facilities that can be registered as add-ons for certain drivers, but their use is so specific that it's not worth talking about them; they use the stacking technique, as described in Section 2.5. If you want to probe further, you can grep for EXPORT_SYMBOL in the kernel sources, and find the entry points offered by different drivers. Most registration functions are prefixed with register_, so another possible way to find them is to grep for register_ in the kernel source.

清理功能

The Cleanup Function

每个重要的模块还需要一个清理函数,该函数 在删除模块之前取消注册接口并将所有资源返回给系统。该函数定义为:

Every nontrivial module also requires a cleanup function, which unregisters interfaces and returns all resources to the system before the module is removed. This function is defined as:

静态无效__退出cleanup_function(无效)
{
    /* 清理代码在这里 */
}

module_exit(cleanup_function);
static void _ _exit cleanup_function(void)
{
    /* Cleanup code here */
}

module_exit(cleanup_function);

cleanup 函数没有返回值,因此声明为void。修饰符_ _exit将代码标记为仅用于模块卸载(通过使编译器将其放置在特殊的 ELF 部分中)。如果您的模块直接构建到内核中,或者如果您的内核配置为不允许卸载模块,则标记的函数_ _exit将被简单地丢弃。因此,标记的函数只能在模块卸载或系统关闭时_ _exit调用;任何其他用途都是错误的。再次强调, 需要module_exit声明才能让内核找到您的清理函数。

The cleanup function has no value to return, so it is declared void. The _ _exit modifier marks the code as being for module unload only (by causing the compiler to place it in a special ELF section). If your module is built directly into the kernel, or if your kernel is configured to disallow the unloading of modules, functions marked _ _exit are simply discarded. For this reason, a function marked _ _exit can be called only at module unload or system shutdown time; any other use is an error. Once again, the module_exit declaration is necessary to enable to kernel to find your cleanup function.

如果你的模块没有定义清理函数,内核就不允许它被卸载。

If your module does not define a cleanup function, the kernel does not allow it to be unloaded.

初始化期间的错误处理

Error Handling During Initialization

一件事你必须始终在向内核注册设施时请记住,注册可能会失败。即使是最简单的操作也常常需要分配内存,而所需的内存可能不可用。因此模块代码必须始终检查返回值,并确保请求的操作实际上已成功。

One thing you must always bear in mind when registering facilities with the kernel is that the registration could fail. Even the simplest action often requires memory allocation, and the required memory may not be available. So module code must always check return values, and be sure that the requested operations have actually succeeded.

如果在注册实用程序时发生任何错误,首要任务是确定模块是否可以继续初始化自身。通常,模块可以在注册失败后继续运行,但如有必要,功能会降低。只要有可能,您的模块就应该继续前进,并在出现故障后提供它可以提供的功能。

If any errors occur when you register utilities, the first order of business is to decide whether the module can continue initializing itself anyway. Often, the module can continue to operate after a registration failure, with degraded functionality if necessary. Whenever possible, your module should press forward and provide what capabilities it can after things fail.

如果发现您的模块在特定类型的故障后根本无法加载,则必须撤消故障之前执行的所有注册活动。 Linux 不保存已注册设施的每个模块注册表,因此如果初始化在某个时刻失败,模块必须退出所有内容。如果您无法取消注册所获得的内容,则内核将处于不稳定状态;它包含指向不再存在的代码的内部指针。在这种情况下,唯一的办法通常是重新启动系统。当发生初始化错误时,您确实希望采取正确的措施。

If it turns out that your module simply cannot load after a particular type of failure, you must undo any registration activities performed before the failure. Linux doesn't keep a per-module registry of facilities that have been registered, so the module must back out of everything itself if initialization fails at some point. If you ever fail to unregister what you obtained, the kernel is left in an unstable state; it contains internal pointers to code that no longer exists. In such situations, the only recourse, usually, is to reboot the system. You really do want to take care to do the right thing when an initialization error occurs.

错误恢复有时是 最好用goto声明来处理。我们通常讨厌使用goto,但在我们看来,这是它有用的一种情况。仔细使用goto错误情况可以消除大量复杂的、高度缩进的“结构化”逻辑。因此,在内核中,goto经常使用如下所示的方法来处理错误。

Error recovery is sometimes best handled with the goto statement. We normally hate to use goto, but in our opinion, this is one situation where it is useful. Careful use of goto in error situations can eliminate a great deal of complicated, highly-indented, "structured" logic. Thus, in the kernel, goto is often used as shown here to deal with errors.

如果初始化在任何时候失败,以下示例代码(使用虚构的注册和取消注册函数)将正常运行:

The following sample code (using fictitious registration and unregistration functions) behaves correctly if initialization fails at any point:

int _ _init my_init_function(void)
{
    内部错误;

    /* 注册需要一个指针和一个名称 */
    err = register_this(ptr1, "头骨");
    如果(错误)转到fail_this;
    err = register_that(ptr2, "头骨");
    如果(错误)转到fail_that;
    err = register_those(ptr3, "头骨");
    如果(错误)转到fail_those;

    返回0;/* 成功 */

  fail_those: unregister_that(ptr2, "头骨");
  fail_that: unregister_this(ptr1, "头骨");
  fail_this:返回错误;/* 传播错误 */
 }
int _ _init my_init_function(void)
{
    int err;

    /* registration takes a pointer and a name */
    err = register_this(ptr1, "skull");
    if (err) goto fail_this;
    err = register_that(ptr2, "skull");
    if (err) goto fail_that;
    err = register_those(ptr3, "skull");
    if (err) goto fail_those;

    return 0; /* success */

  fail_those: unregister_that(ptr2, "skull");
  fail_that: unregister_this(ptr1, "skull");
  fail_this: return err; /* propagate the error */
 }

此代码尝试注册三个(虚构的)设施。该goto语句用于在出现故障时仅取消注册在出现问题之前已成功注册的设施的情况。

This code attempts to register three (fictitious) facilities. The goto statement is used in case of failure to cause the unregistration of only the facilities that had been successfully registered before things went bad.

另一种选择,不需要繁琐的goto语句,是跟踪已成功注册的内容并调用 发生任何错误时模块的清理功能。清理功能仅展开已成功完成的步骤。然而,这种替代方案需要更多代码和更多 CPU 时间,因此在快速路径中,您仍然可以将其goto作为最佳错误恢复工具。

Another option, requiring no hairy goto statements, is keeping track of what has been successfully registered and calling your module's cleanup function in case of any error. The cleanup function unrolls only the steps that have been successfully accomplished. This alternative, however, requires more code and more CPU time, so in fast paths you still resort to goto as the best error-recovery tool.

返回值my_init_function , , 是错误err代码。在 Linux 内核中,错误代码是属于<linux/errno.h>中定义的集合的负数。如果您想生成自己的错误代码而不是返回从其他函数获得的错误代码,则应包含<linux/errno.h>以便使用 、 等符号-ENODEV-ENOMEM。返回适当的错误代码始终是一个好习惯,因为用户程序可以使用 perror或类似方法将它们转换为有意义的字符串。

The return value of my_init_function, err, is an error code. In the Linux kernel, error codes are negative numbers belonging to the set defined in <linux/errno.h>. If you want to generate your own error codes instead of returning what you get from other functions, you should include <linux/errno.h> in order to use symbolic values such as -ENODEV, -ENOMEM, and so on. It is always good practice to return appropriate error codes, because user programs can turn them to meaningful strings using perror or similar means.

显然,模块清理函数必须撤消初始化函数执行的任何注册,并且通常(但通常不是强制性的)以与注册设施相反的顺序取消注册设施:

Obviously, the module cleanup function must undo any registration performed by the initialization function, and it is customary (but not usually mandatory) to unregister facilities in the reverse order used to register them:

void _ _exit my_cleanup_function(void)
{
    unregister_those(ptr3, "头骨");
    unregister_that(ptr2, "头骨");
    unregister_this(ptr1, "头骨");
    返回;
}
void _ _exit my_cleanup_function(void)
{
    unregister_those(ptr3, "skull");
    unregister_that(ptr2, "skull");
    unregister_this(ptr1, "skull");
    return;
}

如果您的初始化和清理比处理几个项目更复杂,则该 goto方法可能会变得难以管理,因为所有清理代码都必须在初始化函数中重复,并且多个标签混合在一起。因此,有时,不同的代码布局会更成功。

If your initialization and cleanup are more complex than dealing with a few items, the goto approach may become difficult to manage, because all the cleanup code must be repeated within the initialization function, with several labels intermixed. Sometimes, therefore, a different layout of the code proves more successful.

为了最大限度地减少代码重复并保持一切精简,您要做的就是在发生错误时从初始化中调用清理函数。然后,清理功能必须在撤消其注册之前检查每个项目的状态。最简单的代码如下所示:

What you'd do to minimize code duplication and keep everything streamlined is to call the cleanup function from within the initialization whenever an error occurs. The cleanup function then must check the status of each item before undoing its registration. In its simplest form, the code looks like the following:

构造一些东西 *item1;
结构其他 *item2;
int stuff_ok;

无效 my_cleanup(无效)
{
    如果(第1项)
        release_thing(项目1);
    如果(第2项)
        release_thing2(项目2);
    如果(东西没问题)
        取消注册东西();
    返回;
 }

int _ _init my_init(void)
{
    int err = -ENOMEM;

    item1 = allocate_thing(参数);
    item2 = allocate_thing2(arguments2);
    if (!item2 || !item2)
        走向失败;
    错误 = register_stuff(item1, item2);
    如果(!错误)
        stuff_ok = 1;
    别的
        走向失败;
    返回0;/* 成功 */
   
  失败:
    我的清理();
    返回错误;
}
struct something *item1;
struct somethingelse *item2;
int stuff_ok;

void my_cleanup(void)
{
    if (item1)
        release_thing(item1);
    if (item2)
        release_thing2(item2);
    if (stuff_ok)
        unregister_stuff(  );
    return;
 }

int _ _init my_init(void)
{
    int err = -ENOMEM;

    item1 = allocate_thing(arguments);
    item2 = allocate_thing2(arguments2);
    if (!item2 || !item2)
        goto fail;
    err = register_stuff(item1, item2);
    if (!err)
        stuff_ok = 1;
    else
        goto fail;
    return 0; /* success */ 
   
  fail:
    my_cleanup(  );
    return err;
}

如这段代码所示,您可能需要也可能不需要外部标志来标记初始化步骤的成功,具体取决于您调用的注册/分配函数的语义。无论是否需要标志,这种初始化都可以很好地扩展到大量项目,并且通常比前面所示的技术更好。但请注意,当清理函数_ _exit被非退出代码调用时,无法对其进行标记,如前面的示例所示。

As shown in this code, you may or may not need external flags to mark success of the initialization step, depending on the semantics of the registration/allocation function you call. Whether or not flags are needed, this kind of initialization scales well to a large number of items and is often better than the technique shown earlier. Note, however, that the cleanup function cannot be marked _ _exit when it is called by nonexit code, as in the previous example.

模块加载竞赛

Module-Loading Races

到目前为止,我们的讨论已经进展顺利 模块加载的一个重要方面:竞争条件。如果您在编写初始化函数时不小心,则可能会造成损害整个系统稳定性的情况。我们将在本书后面讨论竞争条件;现在,简单说几点就足够了。

Thus far, our discussion has skated over an important aspect of module loading: race conditions. If you are not careful in how you write your initialization function, you can create situations that can compromise the stability of the system as a whole. We will discuss race conditions later in this book; for now, a couple of quick points will have to suffice.

首先是你应该永远记住注册完成后,内核的其他部分可以立即使用您注册的任何功能。换句话说,内核完全有可能在初始化函数仍在运行时调用您的模块。因此,您的代码必须准备好在完成首次注册后立即被调用。在支持该设施所需的所有内部初始化完成之前,不要注册任何设施。

The first is that you should always remember that some other part of the kernel can make use of any facility you register immediately after that registration has completed. It is entirely possible, in other words, that the kernel will make calls into your module while your initialization function is still running. So your code must be prepared to be called as soon as it completes its first registration. Do not register any facility until all of your internal initialization needed to support that facility has been completed.

您还必须考虑如果您的初始化函数决定失败,但内核的某些部分已经在使用您的模块已注册的设施,会发生什么情况。如果您的模块可能出现这种情况,您应该认真考虑不要让初始化失败。毕竟,该模块显然已经成功导出了一些有用的东西。如果初始化必须失败,它必须小心地绕过在其他地方发生的任何可能的操作。 内核,直到这些操作完成。

You must also consider what happens if your initialization function decides to fail, but some part of the kernel is already making use of a facility your module has registered. If this situation is possible for your module, you should seriously consider not failing the initialization at all. After all, the module has clearly succeeded in exporting something useful. If initialization must fail, it must carefully step around any possible operations going on elsewhere in the kernel until those operations have completed.

模块参数

Module Parameters

驱动程序需要的几个参数 知道可以根据系统的不同而改变。这些可能会有所不同,从要使用的设备编号(我们将在下一章中看到)到驱动程序应如何操作的许多方面。例如,SCSI 适配器的驱动程序通常具有控制标记命令队列的使用的选项,而集成设备电子 (IDE) 驱动程序允许用户控制 DMA 操作。如果您的驱动程序控制较旧的硬件,则可能还需要明确告知在哪里可以找到该硬件的 I/O 端口或 I/O 内存地址。内核通过使驱动程序能够指定在加载驱动程序模块时可能更改的参数来支持这些需求。

Several parameters that a driver needs to know can change from system to system. These can vary from the device number to use (as we'll see in the next chapter) to numerous aspects of how the driver should operate. For example, drivers for SCSI adapters often have options controlling the use of tagged command queuing, and the Integrated Device Electronics (IDE) drivers allow user control of DMA operations. If your driver controls older hardware, it may also need to be told explicitly where to find that hardware's I/O ports or I/O memory addresses. The kernel supports these needs by making it possible for a driver to designate parameters that may be changed when the driver's module is loaded.

这些参数值可以在加载时通​​过insmodmodprobe分配;后者还可以从其配置文件(/etc/modprobe.conf)读取参数分配。这些命令接受命令行上多种类型值的规范。作为演示此功能的一种方式,想象一下本章开头所示的“hello world”模块(称为hellop )急需的增强功能。我们添加两个参数:一个名为 的整数值howmany和一个名为 的字符串whom。那么,我们的功能模块在加载时whom不仅会打招呼一次,而且还会 打招呼。howmany次。然后可以使用命令行加载这样的模块,例如:

These parameter values can be assigned at load time by insmod or modprobe ; the latter can also read parameter assignment from its configuration file (/etc/modprobe.conf ). The commands accept the specification of several types of values on the command line. As a way of demonstrating this capability, imagine a much-needed enhancement to the "hello world" module (called hellop) shown at the beginning of this chapter. We add two parameters: an integer value called howmany and a character string called whom. Our vastly more functional module then, at load time, greets whom not just once, but howmany times. Such a module could then be loaded with a command line such as:

insmod hellop howmany=10 其中=“妈妈”
insmod hellop howmany=10 whom="Mom"

以这种方式加载后,hellop会说“你好,妈妈”10 次。

Upon being loaded that way, hellop would say "Hello, Mom" 10 times.

但是,在insmod可以更改模块参数之前,模块必须使它们可用。参数是用module_param宏声明的,该宏在moduleparam.h中定义。module_param 采用三个参数:变量的名称、变量的类型以及用于随附 sysfs 条目的权限掩码。宏应放置在任何函数之外,并且通常位于源文件的头部附近。因此,hellop将声明其参数并使它们可供insmod使用,如下所示:

However, before insmod can change module parameters, the module must make them available. Parameters are declared with the module_param macro, which is defined in moduleparam.h. module_param takes three parameters: the name of the variable, its type, and a permissions mask to be used for an accompanying sysfs entry. The macro should be placed outside of any function and is typically found near the head of the source file. So hellop would declare its parameters and make them available to insmod as follows:

静态字符*whom =“世界”;
静态 int 数量 = 1;
module_param(howmany, int, S_IRUGO);
module_param(whom,charp,S_IRUGO);
static char *whom = "world";
static int howmany = 1;
module_param(howmany, int, S_IRUGO);
module_param(whom, charp, S_IRUGO);

种类很多,有支持的模块参数:

Numerous types are supported for module parameters:

bool

invbool
bool

invbool

布尔值(true 或 false)(关联变量的类型应为 int)。类型invbool反转值,使真值变为假值,反之亦然。

A boolean (true or false) value (the associated variable should be of type int). The invbool type inverts the value, so that true values become false and vice versa.

charp
charp

一个字符指针值。内存被分配给用户提供的字符串,并相应地设置指针。

A char pointer value. Memory is allocated for user-provided strings, and the pointer is set accordingly.

int

long

short

uint

ulong

ushort
int

long

short

uint

ulong

ushort

各种长度的基本整数值。以 开头的版本u适用于无符号值。

Basic integer values of various lengths. The versions starting with u are for unsigned values.

模块加载器还支持数组参数,其中值以逗号分隔的列表形式提供。到 声明一个数组参数,使用:

Array parameters, where the values are supplied as a comma-separated list, are also supported by the module loader. To declare an array parameter, use:

module_param_array(名称,类型,nump,perm);
module_param_array(name,type,nump,perm);

其中name是数组(和参数)的名称,type是数组元素的类型, nump是整型变量,perm是通常的权限值。如果在加载时设置数组参数,nump则设置为提供的值的数量。模块加载器拒绝接受超出数组容量的值。

Where name is the name of your array (and of the parameter), type is the type of the array elements, nump is an integer variable, and perm is the usual permissions value. If the array parameter is set at load time, nump is set to the number of values supplied. The module loader refuses to accept more values than will fit in the array.

如果您确实需要上面列表中没有出现的类型,则模块代码中有一些钩子允许您定义它们;有关如何执行此操作的详细信息,请参阅moduleparam.h 。所有模块参数都应该给一个默认值;仅当用户明确告知时,insmod才会更改值。该模块可以通过根据默认值测试参数来检查显式参数。

If you really need a type that does not appear in the list above, there are hooks in the module code that allow you to define them; see moduleparam.h for details on how to do that. All module parameters should be given a default value; insmod changes the value only if explicitly told to by the user. The module can check for explicit parameters by testing parameters against their default values.

最后的module_param字段是一个权限值;您应该使用<linux/stat.h>中的定义。该值控制谁可以访问 sysfs 中模块参数的表示形式。如果 perm设置为0,则根本没有 sysfs 条目;否则,它会出现在/sys/module [ 3 ]下,并具有给定的权限集。用于S_IRUGO外界可以读取但不能更改的参数; S_IRUGO|S_IWUSR允许 root 更改参数。请注意,如果 sysfs 更改了参数,则模块看到的该参数的值也会更改,但不会以任何其他方式通知您的模块。您可能不应该使模块参数可写,除非您准备好检测 改变并做出相应反应。

The final module_param field is a permission value; you should use the definitions found in <linux/stat.h>. This value controls who can access the representation of the module parameter in sysfs. If perm is set to 0, there is no sysfs entry at all; otherwise, it appears under /sys/module [3] with the given set of permissions. Use S_IRUGO for a parameter that can be read by the world but cannot be changed; S_IRUGO|S_IWUSR allows root to change the parameter. Note that if a parameter is changed by sysfs, the value of that parameter as seen by your module changes, but your module is not notified in any other way. You should probably not make module parameters writable, unless you are prepared to detect the change and react accordingly.

在用户空间中执行此操作

Doing It in User Space

第一次解决内核问题的 Unix 程序员可能会对编写模块感到紧张。写作 直接读写设备端口的用户程序可能会更容易。

A Unix programmer who's addressing kernel issues for the first time might be nervous about writing a module. Writing a user program that reads and writes directly to the device ports may be easier.

事实上,有一些支持用户空间编程的论点,有时编写所谓的用户空间设备驱动程序是内核黑客的明智替代方案。在本节中,我们将讨论您可能在用户空间中编写驱动程序的一些原因。然而,本书是关于内核空间驱动程序的,因此我们不会超出这个介绍性讨论。

Indeed, there are some arguments in favor of user-space programming, and sometimes writing a so-called user-space device driver is a wise alternative to kernel hacking. In this section, we discuss some of the reasons why you might write a driver in user space. This book is about kernel-space drivers, however, so we do not go beyond this introductory discussion.

用户空间驱动程序的优点是:

The advantages of user-space drivers are:

  • 可以链接完整的 C 库。驱动程序可以执行许多特殊任务,而无需借助外部程序(实现使用策略的实用程序,通常与驱动程序本身一起分发)。

  • The full C library can be linked in. The driver can perform many exotic tasks without resorting to external programs (the utility programs implementing usage policies that are usually distributed along with the driver itself).

  • 程序员可以在驱动程序代码上运行传统的调试器,而无需通过扭曲来调试正在运行的内核。

  • The programmer can run a conventional debugger on the driver code without having to go through contortions to debug a running kernel.

  • 如果用户空间驱动程序挂起,您可以简单地终止它。驱动程序的问题不太可能导致整个系统挂起,除非被控制的硬件确实出现了 异常

  • If a user-space driver hangs, you can simply kill it. Problems with the driver are unlikely to hang the entire system, unless the hardware being controlled is really misbehaving.

  • 与内核内存不同,用户内存是可交换的。具有巨大驱动程序的不常用设备不会占用其他程序可以使用的 RAM,除非它实际在使用。

  • User memory is swappable, unlike kernel memory. An infrequently used device with a huge driver won't occupy RAM that other programs could be using, except when it is actually in use.

  • 设计良好的驱动程序仍然可以像内核空间驱动程序一样允许并发访问设备。

  • A well-designed driver program can still, like kernel-space drivers, allow concurrent access to a device.

  • 如果您必须编写闭源驱动程序,则用户空间选项可以让您更轻松地避免不明确的许可情况以及更改内核接口时出现的问题。

  • If you must write a closed-source driver, the user-space option makes it easier for you to avoid ambiguous licensing situations and problems with changing kernel interfaces.

例如,可以为用户空间编写USB驱动程序;请参阅 libusb.sourceforge.net 上的(仍然年轻的)libusb 项目和内核源代码中的“gadgetfs”。另一个例子是X服务器:它确切地知道硬件可以做什么和不能做什么,并且它向所有X客户端提供图形资源。但请注意,正在缓慢但稳定地转向基于帧缓冲区的图形环境,其中 X 服务器仅充当基于真实内核空间设备驱动程序的服务器,用于实际的图形操作。

For example, USB drivers can be written for user space; see the (still young) libusb project at libusb.sourceforge.net and "gadgetfs" in the kernel source. Another example is the X server: it knows exactly what the hardware can do and what it can't, and it offers the graphic resources to all X clients. Note, however, that there is a slow but steady drift toward frame-buffer-based graphics environments, where the X server acts only as a server based on a real kernel-space device driver for actual graphic manipulation.

通常,用户空间驱动程序的编写者会实现一个服务器进程,从内核接管作为负责硬件控制的单一代理的任务。然后,客户端应用程序可以连接到服务器以与设备执行实际通信;因此,智能驱动程序进程可以允许并发访问设备。这正是 X 服务器的工作原理。

Usually, the writer of a user-space driver implements a server process, taking over from the kernel the task of being the single agent in charge of hardware control. Client applications can then connect to the server to perform actual communication with the device; therefore, a smart driver process can allow concurrent access to the device. This is exactly how the X server works.

但用户空间的设备驱动方法有许多缺点。最重要的是:

But the user-space approach to device driving has a number of drawbacks. The most important are:

  • 中断在用户空间中不可用。在某些平台上有针对此限制的解决方法,例如IA32 架构上的vm86系统调用。

  • Interrupts are not available in user space. There are workarounds for this limitation on some platforms, such as the vm86 system call on the IA32 architecture.

  • 只有通过mmap ping /dev/mem才能直接访问内存,并且只有特权用户才能执行此操作。

  • Direct access to memory is possible only by mmapping /dev/mem, and only a privileged user can do that.

  • 仅在调用iopermiopl后才能访问 I/O 端口。此外,并非所有平台都支持这些系统调用,并且访问/dev/port可能会太慢而无法有效。系统调用和设备文件都保留给特权用户。

  • Access to I/O ports is available only after calling ioperm or iopl. Moreover, not all platforms support these system calls, and access to /dev/port can be too slow to be effective. Both the system calls and the device file are reserved to a privileged user.

  • 响应时间较慢,因为需要上下文切换来在客户端和硬件之间传输信息或操作。

  • Response time is slower, because a context switch is required to transfer information or actions between the client and the hardware.

  • 更糟糕的是,如果驱动程序已交换到磁盘,则响应时间会长得令人无法接受。使用mlock系统调用可能会有所帮助,但通常您需要锁定许多内存页,因为用户空间程序依赖于许多库代码。mlock也仅限于特权用户。

  • Worse yet, if the driver has been swapped to disk, response time is unacceptably long. Using the mlock system call might help, but usually you'll need to lock many memory pages, because a user-space program depends on a lot of library code. mlock, too, is limited to privileged users.

  • 最重要的设备无法在用户空间中处理,包括但不限于网络接口和块设备。

  • The most important devices can't be handled in user space, including, but not limited to, network interfaces and block devices.

正如您所看到的,用户空间驱动程序毕竟不能做那么多事情。尽管如此,仍然存在一些有趣的应用程序:例如,对 SCSI 扫描仪设备(由 SANE包实现)和 CD 刻录机(由 cdrecord和其他工具实现)的支持。在这两种情况下,用户级设备驱动程序都依赖于“SCSI 通用”内核驱动程序,该驱动程序将低级 SCSI 功能导出到用户空间程序,以便它们可以驱动自己的硬件。

As you see, user-space drivers can't do that much after all. Interesting applications nonetheless exist: for example, support for SCSI scanner devices (implemented by the SANE package) and CD writers (implemented by cdrecord and other tools). In both cases, user-level device drivers rely on the "SCSI generic" kernel driver, which exports low-level SCSI functionality to user-space programs so they can drive their own hardware.

在用户空间中工作可能有意义的一种情况是当您开始处理新的和不寻常的硬件时。通过这种方式,您可以学习管理硬件,而无需担心整个系统挂起的风险。完成此操作后,将软件封装在内核模块中应该是一个简单的过程 无痛操作。

One case in which working in user space might make sense is when you are beginning to deal with new and unusual hardware. This way you can learn to manage your hardware without the risk of hanging the whole system. Once you've done that, encapsulating the software in a kernel module should be a painless operation.

快速参考

Quick Reference

本节总结了本章中涉及的内核函数、变量、宏和/proc文件。它的目的是作为参考。每个项目都列在相关头文件(如果有)之后。从现在开始,几乎每一章的末尾都会出现类似的部分,总结本章中引入的新符号。本节中的条目通常按照本章中介绍的顺序出现:

This section summarizes the kernel functions, variables, macros, and /proc files that we've touched on in this chapter. It is meant to act as a reference. Each item is listed after the relevant header file, if any. A similar section appears at the end of almost every chapter from here on, summarizing the new symbols introduced in the chapter. Entries in this section generally appear in the same order in which they were introduced in the chapter:

insmod

modprobe

rmmod
insmod

modprobe

rmmod

将模块加载到正在运行的内核中并删除它们的用户空间实用程序。

User-space utilities that load modules into the running kernels and remove them.

#include <linux/init.h>

module_init(init_function);

module_exit(cleanup_function);
#include <linux/init.h>

module_init(init_function);

module_exit(cleanup_function);

指定模块的初始化和清理函数的宏。

Macros that designate a module's initialization and cleanup functions.

_ _init

_ _initdata

_ _exit

_ _exitdata
_ _init

_ _initdata

_ _exit

_ _exitdata

仅在模块初始化或清理时使用的函数 (_ _init_ _exit) 和数据 (_ _initdata和) 的标记。_ _exitdata一旦初始化完成,标记为初始化的项目可能会被丢弃;如果内核中没有配置模块卸载,则退出项可能会被丢弃。这些标记的工作原理是将相关对象放置在可执行文件中的特殊 ELF 部分中。

Markers for functions (_ _init and _ _exit) and data (_ _initdata and _ _exitdata) that are only used at module initialization or cleanup time. Items marked for initialization may be discarded once initialization completes; the exit items may be discarded if module unloading has not been configured into the kernel. These markers work by causing the relevant objects to be placed in a special ELF section in the executable file.

#include <linux/sched.h>
#include <linux/sched.h>

最重要的头文件之一。该文件包含驱动程序使用的大部分内核 API 的定义,包括睡眠函数和大量变量声明。

One of the most important header files. This file contains definitions of much of the kernel API used by the driver, including functions for sleeping and numerous variable declarations.

struct task_struct *current;
struct task_struct *current;

当前流程。

The current process.

current->pid

current->comm
current->pid

current->comm

当前进程的进程 ID 和命令名称。

The process ID and command name for the current process.

obj-m
obj-m

内核构建系统使用的 makefile 符号来确定应在当前目录中构建哪些模块。

A makefile symbol used by the kernel build system to determine which modules should be built in the current directory.

/sys/模块

/proc/模块
/sys/module

/proc/modules

/sys/module是一个 sysfs 目录层次结构,包含当前加载的模块的信息。/proc/modules是该信息的较旧的单文件版本。条目包含模块名称、每个模块占用的内存量以及使用计数。额外的字符串附加到每行以指定模块当前处于活动状态的标志。

/sys/module is a sysfs directory hierarchy containing information on currently-loaded modules. /proc/modules is the older, single-file version of that information. Entries contain the module name, the amount of memory each module occupies, and the usage count. Extra strings are appended to each line to specify flags that are currently active for the module.

vermagic.o
vermagic.o

内核源目录中的目标文件,描述模块构建的环境。

An object file from the kernel source directory that describes the environment a module was built for.

#include <linux/module.h>
#include <linux/module.h>

必需的标头。它必须包含在模块源中。

Required header. It must be included by a module source.

#include <linux/version.h>
#include <linux/version.h>

包含正在构建的内核版本信息的头文件。

A header file containing information on the version of the kernel being built.

LINUX_VERSION_CODE
LINUX_VERSION_CODE

整数宏,对于 #ifdef版本依赖性有用。

Integer macro, useful to #ifdef version dependencies.

EXPORT_SYMBOL (symbol);

EXPORT_SYMBOL_GPL (symbol);
EXPORT_SYMBOL (symbol);

EXPORT_SYMBOL_GPL (symbol);

用于将符号导出到内核的宏。第二种形式将导出符号的使用限制为 GPL 许可的模块。

Macro used to export a symbol to the kernel. The second form limits use of the exported symbol to GPL-licensed modules.

MODULE_AUTHOR(author);

MODULE_DESCRIPTION(description);

MODULE_VERSION(version_string);

MODULE_DEVICE_TABLE(table_info);

MODULE_ALIAS(alternate_name);
MODULE_AUTHOR(author);

MODULE_DESCRIPTION(description);

MODULE_VERSION(version_string);

MODULE_DEVICE_TABLE(table_info);

MODULE_ALIAS(alternate_name);

将模块的文档放置在目标文件中。

Place documentation on the module in the object file.

MODULE_LICENSE(license);
MODULE_LICENSE(license);

声明管理该模块的许可证。

Declare the license governing this module.

#include <linux/moduleparam.h>

module_param(variable, type, perm);
#include <linux/moduleparam.h>

module_param(variable, type, perm);

创建模块参数的宏,用户可以在加载模块时(或内置代码启动时)调整该参数。类型可以是boolcharpintinvboollongshortushortuintulong、 或 之一intarray

Macro that creates a module parameter that can be adjusted by the user when the module is loaded (or at boot time for built-in code). The type can be one of bool, charp, int, invbool, long, short, ushort, uint, ulong, or intarray.

#include <linux/kernel.h>

int printk(const char * fmt, ...);
#include <linux/kernel.h>

int printk(const char * fmt, ...);

内核代码的printf的类似物。

The analogue of printf for kernel code.




[ 1 ]优先级只是一个字符串,例如<1>,它被添加到printk格式字符串之前。请注意 ; 后面缺少逗号KERN_ALERT。添加逗号是一个常见且烦人的拼写错误(幸运的是,编译器已捕获该错误)。

[1] The priority is just a string, such as <1>, which is prepended to the printk format string. Note the lack of a comma after KERN_ALERT; adding a comma there is a common and annoying typo (which, fortunately, is caught by the compiler).

[ 2 ]这允许稳定版本之间最多有 256 个开发版本。

[2] This allows up to 256 development versions between stable versions.

[ 3 ]然而,在撰写本文时,有人讨论将参数移动到 sysfs 内的其他位置。

[3] As of this writing, there is talk of moving parameters elsewhere within sysfs, however.

第 3 章 Char 驱动程序

Chapter 3. Char Drivers

本章的目标是编写一个完整的字符设备驱动程序。我们开发字符驱动程序是因为该类适用于大多数简单的硬件设备。字符驱动程序也比块驱动程序或网络驱动程序更容易理解(我们将在后面的章节中介绍)。我们的最终目标是编写一个模块化的字符驱动程序,但本章我们不会讨论模块化问题。

The goal of this chapter is to write a complete char device driver. We develop a character driver because this class is suitable for most simple hardware devices. Char drivers are also easier to understand than block drivers or network drivers (which we get to in later chapters). Our ultimate aim is to write a modularized char driver, but we won't talk about modularization issues in this chapter.

在整章中,我们呈现从真实设备驱动程序中提取的代码片段:scull简单的用于加载位置的字符实用程序)。 scull是一个字符驱动程序,它作用于内存区域,就好像它是一个设备一样。在本章中,由于scull的特殊性,我们将“设备”一词与“ scull使用的内存区域 ”互换使用。

Throughout the chapter, we present code fragments extracted from a real device driver: scull (Simple Character Utility for Loading Localities). scull is a char driver that acts on a memory area as though it were a device. In this chapter, because of that peculiarity of scull, we use the word device interchangeably with "the memory area used by scull."

scull的优点是它不依赖于硬件。 scull只作用于从内核分配的一些内存。任何人都可以编译和运行scull,并且scull可以跨运行 Linux 的计算机体系结构移植。另一方面,除了演示内核和字符驱动程序之间的接口并允许用户运行一些测试之外,该设备不做任何“有用”的事情。

The advantage of scull is that it isn't hardware dependent. scull just acts on some memory, allocated from the kernel. Anyone can compile and run scull, and scull is portable across the computer architectures on which Linux runs. On the other hand, the device doesn't do anything "useful" other than demonstrate the interface between the kernel and char drivers and allow the user to run some tests.

双桨的设计

The Design of scull

司机的第一步写作 定义驱动程序将提供给用户程序的功能(机制)。由于我们的“设备”是计算机内存的一部分,因此我们可以自由地用它做我们想做的事情。它可以是顺序或随机访问设备、一个或多个设备,等等。

The first step of driver writing is defining the capabilities (the mechanism) the driver will offer to user programs. Since our "device" is part of the computer's memory, we're free to do what we want with it. It can be a sequential or random-access device, one device or many, and so on.

为了使scull成为为真实设备编写真实驱动程序的模板,我们将向您展示如何在计算机内存之上实现多个设备抽象,每个抽象都有不同的个性。

To make scull useful as a template for writing real drivers for real devices, we'll show you how to implement several device abstractions on top of the computer memory, each with a different personality.

scull源实现了以下设备模块实现的每种设备称为类型

The scull source implements the following devices. Each kind of device implemented by the module is referred to as a type .

scull0scull3
scull0 to scull3

四个设备,每个设备包括一个全局且持久的内存区域。全局意味着如果多次打开设备,则设备中包含的数据将由所有打开它的文件描述符共享。持久意味着如果设备关闭并重新打开,数据不会丢失。该设备使用起来很有趣,因为可以使用常规命令(例如 cpcat和 shell I/O 重定向)来访问和测试它。

Four devices, each consisting of a memory area that is both global and persistent. Global means that if the device is opened multiple times, the data contained within the device is shared by all the file descriptors that opened it. Persistent means that if the device is closed and reopened, data isn't lost. This device can be fun to work with, because it can be accessed and tested using conventional commands, such as cp, cat, and shell I/O redirection.

scullpipe0scullpipe3
scullpipe0 to scullpipe3

FIFO(先进先出)设备,其作用类似于管道。一个进程读取另一进程写入的内容。如果多个进程读取同一设备,它们就会争夺数据。scullpipe的内部结构将展示如何在不诉诸中断情况下实现 阻塞和非阻塞读写 尽管真正的驱动程序使用硬件中断与其设备同步,但阻塞和非阻塞操作的主题是一个重要的主题,并且与中断处理分开(在第10 章中介绍)。

Four FIFO (first-in-first-out) devices, which act like pipes. One process reads what another process writes. If multiple processes read the same device, they contend for data. The internals of scullpipe will show how blocking and nonblocking read and write can be implemented without having to resort to interrupts. Although real drivers synchronize with their devices using hardware interrupts, the topic of blocking and nonblocking operations is an important one and is separate from interrupt handling (covered in Chapter 10).

scullsingle

scullpriv

sculluid

scullwuid
scullsingle

scullpriv

sculluid

scullwuid

这些设备与scull0类似,但在允许打开时有一些限制。第一个 ( scullsingle ) 一次只允许一个进程使用驱动程序,而scullpriv是每个虚拟控制台(或 X 终端会话)专用的,因为每个控制台/终端上的进程获得不同的内存区域。sculluidscullwuid可以多次打开,但一次只能由一个用户打开;如果另一个用户正在锁定设备,前者会返回“设备忙”错误,而后者则实现阻止 打开scull的这些变体似乎会混淆策略和机制,但它们值得研究,因为一些现实生活中的设备需要这种管理。

These devices are similar to scull0 but with some limitations on when an open is permitted. The first (scullsingle) allows only one process at a time to use the driver, whereas scullpriv is private to each virtual console (or X terminal session), because processes on each console/terminal get different memory areas. sculluid and scullwuid can be opened multiple times, but only by one user at a time; the former returns an error of "Device Busy" if another user is locking the device, whereas the latter implements blocking open. These variations of scull would appear to be confusing policy and mechanism, but they are worth looking at, because some real-life devices require this sort of management.

每个scull设备都展示了驱动程序的不同功能并提出了不同的困难。本章涵盖 scull0scull3的内部结构;第 6 章介绍了更高级的设备。 scullpipe在第 3.4节中描述,其他在 第 6.6 节中描述

Each of the scull devices demonstrates different features of a driver and presents different difficulties. This chapter covers the internals of scull0 to scull3; the more advanced devices are covered in Chapter 6. scullpipe is described in the section Section 3.4 and the others are described in Section 6.6

主要和次要数字

Major and Minor Numbers

查尔设备被访问通过 文件系统中的名称。这些名称称为特殊文件或设备文件或简称为文件系统树的节点;它们通常位于/dev 目录。字符驱动程序的特殊文件由ls -l输出的第一列中的“c”标识。块设备也出现在/dev中,但它们由“b”标识。本章的重点是字符设备,但以下大部分信息也适用于块设备。

Char devices are accessed through names in the filesystem. Those names are called special files or device files or simply nodes of the filesystem tree; they are conventionally located in the /dev directory. Special files for char drivers are identified by a "c" in the first column of the output of ls -l. Block devices appear in /dev as well, but they are identified by a "b." The focus of this chapter is on char devices, but much of the following information applies to block devices as well.

如果发出ls -l命令,您将在上次修改日期之前的设备文件条目中看到两个数字(以逗号分隔),其中通常显示文件长度。这些数字是特定设备的主设备号和次设备号。以下列表显示了典型系统上出现的一些设备。它们的大数是 1、4、7 和 10,小数是 1、3、5、64、65 和 129。

If you issue the ls -l command, you'll see two numbers (separated by a comma) in the device file entries before the date of the last modification, where the file length normally appears. These numbers are the major and minor device number for the particular device. The following listing shows a few devices as they appear on a typical system. Their major numbers are 1, 4, 7, and 10, while the minors are 1, 3, 5, 64, 65, and 129.

crw-rw-rw- 1 根 根 1, 3 Apr 11 2002 null
 crw----- 1 根 根 10, 2002 年 4 月 11 日 1 日 psaux
 crw----- 1 root root 4, 1 Oct 28 03:04 tty1
 crw-rw-rw- 1 root tty 4, 64 2002 年 4 月 11 日 ttys0
 crw-rw---- 1 root uucp 4, 65 2002 年 4 月 11 日 ttyS1
 crw--w---- 1 vcsa tty 7, 2002 年 4 月 11 日 1 日 vcs1
 crw--w---- 1 vcsa tty 7, 129 2002 年 4 月 11 日 vcsa1
 crw-rw-rw- 1 根 根 1, 5 Apr 11 2002 零
 crw-rw-rw-    1 root     root       1,   3 Apr 11  2002 null
 crw-------    1 root     root      10,   1 Apr 11  2002 psaux
 crw-------    1 root     root       4,   1 Oct 28 03:04 tty1
 crw-rw-rw-    1 root     tty        4,  64 Apr 11  2002 ttys0
 crw-rw----    1 root     uucp       4,  65 Apr 11  2002 ttyS1
 crw--w----    1 vcsa     tty        7,   1 Apr 11  2002 vcs1
 crw--w----    1 vcsa     tty        7, 129 Apr 11  2002 vcsa1
 crw-rw-rw-    1 root     root       1,   5 Apr 11  2002 zero

传统上,主编号标识与设备关联的驱动程序。例如,/dev/null/dev/zero均由驱动程序 1 管理,而虚拟控制台和串行终端由驱动程序 4 管理;同样,vcs1vcsa1设备均由驱动程序 7 管理。现代 Linux 内核允许多个驱动程序共享主设备号,但您将看到的大多数设备仍然按照一主设备一驱动程序原则进行组织。

Traditionally, the major number identifies the driver associated with the device. For example, /dev/null and /dev/zero are both managed by driver 1, whereas virtual consoles and serial terminals are managed by driver 4; similarly, both vcs1 and vcsa1 devices are managed by driver 7. Modern Linux kernels allow multiple drivers to share major numbers, but most devices that you will see are still organized on the one-major-one-driver principle.

内核使用次设备号来准确确定所引用的设备。根据驱动程序的编写方式(如下所示),您可以从内核获取指向设备的直接指针,也可以自己使用次要编号作为本地设备数组的索引。不管怎样,内核本身对次要数字几乎一无所知,除了它们引用驱动程序实现的设备这一事实之外。

The minor number is used by the kernel to determine exactly which device is being referred to. Depending on how your driver is written (as we will see below), you can either get a direct pointer to your device from the kernel, or you can use the minor number yourself as an index into a local array of devices. Either way, the kernel itself knows almost nothing about minor numbers beyond the fact that they refer to devices implemented by your driver.

设备编号的内部表示

The Internal Representation of Device Numbers

在内核中,dev_t类型(在 <linux/types.h>中定义)是用于保存设备号——主要的和 小零件。从内核版本 2.6.0 开始,dev_t是一个 32 位数量,其中留出 12 位用于 主编号和次编号 20。当然,您的代码不应该对设备编号的内部组织做出任何假设;相反,它应该使用一组在<linux/kdev_t.h>中找到的宏。要获取 a 的主要或次要部分dev_t,请使用:

Within the kernel, the dev_t type (defined in <linux/types.h>) is used to hold device numbers—both the major and minor parts. As of Version 2.6.0 of the kernel, dev_t is a 32-bit quantity with 12 bits set aside for the major number and 20 for the minor number. Your code should, of course, never make any assumptions about the internal organization of device numbers; it should, instead, make use of a set of macros found in <linux/kdev_t.h>. To obtain the major or minor parts of a dev_t, use:

主要(dev_t dev);
次要(dev_t dev);
MAJOR(dev_t dev);
MINOR(dev_t dev);

相反,如果您有主要编号和次要编号并且需要将它们转换为 dev_t,请使用:

If, instead, you have the major and minor numbers and need to turn them into a dev_t, use:

MKDEV(int 大调, int 小调);
MKDEV(int major, int minor);

请注意,2.6 内核可以容纳大量设备,而以前的内核版本仅限于 255 个主要设备号和 255 个次要设备号。人们假设更宽的范围足以在相当长的一段时间内使用,但计算领域充斥着这种性质的错误假设。所以你应该预料到 的格式 dev_t将来可能会再次改变;然而,如果您仔细编写驱动程序,这些更改将不会成为问题。

Note that the 2.6 kernel can accommodate a vast number of devices, while previous kernel versions were limited to 255 major and 255 minor numbers. One assumes that the wider range will be sufficient for quite some time, but the computing field is littered with erroneous assumptions of that nature. So you should expect that the format of dev_t could change again in the future; if you write your drivers carefully, however, these changes will not be a problem.

分配和释放设备编号

Allocating and Freeing Device Numbers

您的司机首先要做的事情之一 将需要做当设置一个字符设备时,就是获取一个或多个要使用的设备号。此任务所需的函数是register_chrdev_region ,它在<linux/fs.h>中声明 :

One of the first things your driver will need to do when setting up a char device is to obtain one or more device numbers to work with. The necessary function for this task is register_chrdev_region, which is declared in <linux/fs.h>:

int register_chrdev_region(dev_t 首先,无符号整数计数,
                           字符*名称);
int register_chrdev_region(dev_t first, unsigned int count, 
                           char *name);

这里,first就是开始您想要分配的范围的设备编号。的次要数字部分first通常是0,但对此效果没有要求。count是您请求的连续设备编号的总数。请注意,如果count很大,您请求的范围可能会溢出到下一个主编号;但只要您请求的号码范围可用,一切仍然可以正常工作。最后,name是应与该数字范围关联的设备的名称;它将出现在/proc/devices和 sysfs 中。

Here, first is the beginning device number of the range you would like to allocate. The minor number portion of first is often 0, but there is no requirement to that effect. count is the total number of contiguous device numbers you are requesting. Note that, if count is large, the range you request could spill over to the next major number; but everything will still work properly as long as the number range you request is available. Finally, name is the name of the device that should be associated with this number range; it will appear in /proc/devices and sysfs.

与大多数内核函数一样, register_chrdev_region的返回值 将是0 分配是否成功执行。如果发生错误,将返回负错误代码,并且您将无法访问所请求的区域。

As with most kernel functions, the return value from register_chrdev_region will be 0 if the allocation was successfully performed. In case of error, a negative error code will be returned, and you will not have access to the requested region.

如果您提前准确知道所需的设备编号,则register_chrdev_region效果很好。然而,您通常不知道您的设备将使用哪些主号码;Linux 内核开发社区一直在努力转向使用动态分配的设备号。内核会很高兴地为您分配一个主编号,但您必须使用不同的函数来请求此分配:

register_chrdev_region works well if you know ahead of time exactly which device numbers you want. Often, however, you will not know which major numbers your device will use; there is a constant effort within the Linux kernel development community to move over to the use of dynamicly-allocated device numbers. The kernel will happily allocate a major number for you on the fly, but you must request this allocation by using a different function:

int alloc_chrdev_region(dev_t *dev, 无符号整数firstminor,
                        无符号整数计数,字符*名称);
int alloc_chrdev_region(dev_t *dev, unsigned int firstminor, 
                        unsigned int count, char *name);

对于此函数,dev是一个仅输出参数,成功完成后,它将保存分配范围中的第一个数字。 firstminor应是请求使用的第一个次要号码;通常是这样0。和参数count的工作方式与request_chrdev_regionname的参数类似。

With this function, dev is an output-only parameter that will, on successful completion, hold the first number in your allocated range. firstminor should be the requested first minor number to use; it is usually 0. The count and name parameters work like those given to request_chrdev_region.

无论您如何分配设备编号,都应该在不再使用它们时将其释放。设备编号通过以下方式释放:

Regardless of how you allocate your device numbers, you should free them when they are no longer in use. Device numbers are freed with:

void unregister_chrdev_region(dev_t 首先,无符号整数计数);
void unregister_chrdev_region(dev_t first, unsigned int count);

调用unregister_chrdev_region 的通常位置是在模块的清理函数中。

The usual place to call unregister_chrdev_region would be in your module's cleanup function.

上述函数分配设备编号供驱动程序使用,但它们不会告诉内核您实际将如何处理这些编号。在用户空间程序可以访问这些设备号之一之前,您的驱动程序需要将它们连接到实现设备操作的内部函数。我们将很快描述这种连接是如何完成的,但首先有一些必要的题外话需要注意。

The above functions allocate device numbers for your driver's use, but they do not tell the kernel anything about what you will actually do with those numbers. Before a user-space program can access one of those device numbers, your driver needs to connect them to its internal functions that implement the device's operations. We will describe how this connection is accomplished shortly, but there are a couple of necessary digressions to take care of first.

主号码动态分配

Dynamic Allocation of Major Numbers

一些主要设备编号是静态的 分配给最常见的设备。 这些设备的列表可以在内核源代码树中的Documentation/devices.txt中找到。的机会是然而,已分配供新驱动程序使用的静态编号很小,并且不会分配新编号。因此,作为驱动程序编写者,您可以选择:您可以简单地选择一个看似未使用的编号,也可以以动态方式分配主编号。只要您的驱动程序的唯一用户是您,选择一个号码就可能有效;一旦您的驱动程序得到更广泛的部署,随机选择的主要编号将导致冲突和麻烦。

Some major device numbers are statically assigned to the most common devices. A list of those devices can be found in Documentation/devices.txt within the kernel source tree. The chances of a static number having already been assigned for the use of your new driver are small, however, and new numbers are not being assigned. So, as a driver writer, you have a choice: you can simply pick a number that appears to be unused, or you can allocate major numbers in a dynamic manner. Picking a number may work as long as the only user of your driver is you; once your driver is more widely deployed, a randomly picked major number will lead to conflicts and trouble.

因此,对于新司机来说,我们强烈建议您使用动态分配的方式来获取您的主设备号,而不是从当前空闲的设备号中随机选择一个号。换句话说,您的驱动程序几乎肯定应该使用 alloc_chrdev_region而不是 register_chrdev_region

Thus, for new drivers, we strongly suggest that you use dynamic allocation to obtain your major device number, rather than choosing a number randomly from the ones that are currently free. In other words, your drivers should almost certainly be using alloc_chrdev_region rather than register_chrdev_region.

动态分配的缺点是您无法提前创建设备节点,因为分配给您的模块的主设备号会有所不同。对于驱动程序的正常使用来说,这几乎不是问题,因为一旦分配了编号,您就可以从 /proc/devices中读取它。[ 1 ]

The disadvantage of dynamic assignment is that you can't create the device nodes in advance, because the major number assigned to your module will vary. For normal use of the driver, this is hardly a problem, because once the number has been assigned, you can read it from /proc/devices.[1]

使用动态加载驱动程序 因此,可以用一个简单的脚本来替换insmod的调用,该脚本在调用insmod之后读取/proc/devices以创建特殊文件。

To load a driver using a dynamic major number, therefore, the invocation of insmod can be replaced by a simple script that, after calling insmod, reads /proc/devices in order to create the special file(s).

典型的/proc/devices文件如下所示:

A typical /proc/devices file looks like the following:

字符设备:
 1内存
 2 人
 3 打字机
 4 个终端
 6 LP
 7 个VC
 10 杂项
 13输入
 14声
 21 SG
180个USB

块设备:
 2 FD
 8 标准差
 11 高级
 65标清
 66标清
Character devices:
 1 mem
 2 pty
 3 ttyp
 4 ttyS
 6 lp
 7 vcs
 10 misc
 13 input
 14 sound
 21 sg
180 usb

Block devices:
 2 fd
 8 sd
 11 sr
 65 sd
 66 sd

加载模块的脚本 已分配动态编号,因此可以使用awk等工具从/proc/devices检索信息 ,以便在 /dev中创建文件。

The script to load a module that has been assigned a dynamic number can, therefore, be written using a tool such as awk to retrieve information from /proc/devices in order to create the files in /dev.

以下脚本scull_load是scull发行版的一部分 。以模块形式分发的驱动程序的用户可以从系统的rc.local文件中调用此类脚本,或者在需要该模块时手动调用它。

The following script, scull_load, is part of the scull distribution. The user of a driver that is distributed in the form of a module can invoke such a script from the system's rc.local file or call it manually whenever the module is needed.

#!/bin/sh
模块=“双桨”
设备=“双桨”
模式=“664”

# 使用我们得到的所有参数调用 insmod
# 并使用路径名,因为较新的 modutils 不会在 . 默认情况下
/sbin/insmod ./$module.ko $* || 1号出口

# 删除旧节点
rm -f /dev/${设备}[0-3]

主要=$(awk "\\$2= =\"$module\" {print \\$1}" /proc/devices)

mknod /dev/${device}0 c $major 0
mknod /dev/${device}1 c $major 1
mknod /dev/${device}2 c $major 2
mknod /dev/${device}3 c $major 3

# 给予适当的组/权限,并更改组。
# 并非所有发行版都有员工,有些发行版有“轮子”。
组=“工作人员”
grep -q '^员工:' /etc/group || 组=“轮子”

chgrp $group /dev/${设备}[0-3]
chmod $mode /dev/${设备}[0-3]
#!/bin/sh
module="scull"
device="scull"
mode="664"

# invoke insmod with all arguments we got
# and use a pathname, as newer modutils don't look in . by default
/sbin/insmod ./$module.ko $* || exit 1

# remove stale nodes
rm -f /dev/${device}[0-3]

major=$(awk "\\$2=  =\"$module\" {print \\$1}" /proc/devices)

mknod /dev/${device}0 c $major 0
mknod /dev/${device}1 c $major 1
mknod /dev/${device}2 c $major 2
mknod /dev/${device}3 c $major 3

# give appropriate group/permissions, and change the group.
# Not all distributions have staff, some have "wheel" instead.
group="staff"
grep -q '^staff:' /etc/group || group="wheel"

chgrp $group /dev/${device}[0-3]
chmod $mode  /dev/${device}[0-3]

通过重新定义变量和调整mknod行,可以将该脚本改编为另一个驱动程序。刚刚显示的脚本创建了四个设备,因为scull源中的默认值是四个。

The script can be adapted for another driver by redefining the variables and adjusting the mknod lines. The script just shown creates four devices because four is the default in the scull sources.

脚本的最后几行可能看起来很晦涩:为什么要更改设备的组和模式?原因是该脚本必须由超级用户运行,因此新创建的特殊文件归root所有。默认的权限位使得只有 root 具有写访问权限,而任何人都可以获得读访问权限。通常,设备节点需要不同的访问策略,因此必须以某种方式更改访问权限。我们的脚本中的默认设置是授予一组用户访问权限,但您的需求可能会有所不同。在第 3 章6.6 节中, sculluid的代码 演示了驱动程序如何强制执行自己的设备访问授权类型。

The last few lines of the script may seem obscure: why change the group and mode of a device? The reason is that the script must be run by the superuser, so newly created special files are owned by root. The permission bits default so that only root has write access, while anyone can get read access. Normally, a device node requires a different access policy, so in some way or another access rights must be changed. The default in our script is to give access to a group of users, but your needs may vary. In Section 6.6 in Chapter 3, the code for sculluid demonstrates how the driver can enforce its own kind of authorization for device access.

scull_unload脚本也可用于清理/dev目录并删除模块。

A scull_unload script is also available to clean up the /dev directory and remove the module.

作为使用一对脚本进行加载和卸载的替代方法,您可以编写一个 init 脚本,准备好放置在您的发行版用于这些脚本的目录中。[ 2 ]作为scull源代码的一部分,我们提供了一个相当完整且可配置的 init 脚本示例,称为scull.init;它接受常规参数start— 、stop和— 并执行scull_loadscull_unloadrestart的角色。

As an alternative to using a pair of scripts for loading and unloading, you could write an init script, ready to be placed in the directory your distribution uses for these scripts.[2] As part of the scull source, we offer a fairly complete and configurable example of an init script, called scull.init; it accepts the conventional arguments—start, stop, and restart—and performs the role of both scull_load and scull_unload.

如果重复创建和销毁/dev节点听起来有点大材小用,那么有一个有用的解决方法。如果您仅加载和卸载单个驱动程序,则可以 在第一次使用脚本创建特殊文件后使用rmmodinsmod :动态数字不是随机的, [ 3 ],您可以指望选择相同的数字每次如果您不加载任何其他(动态)模块。在开发过程中避免冗长的脚本很有用。但显然,这个技巧一次不能扩展到多个驱动程序。

If repeatedly creating and destroying /dev nodes sounds like overkill, there is a useful workaround. If you are loading and unloading only a single driver, you can just use rmmod and insmod after the first time you create the special files with your script: dynamic numbers are not randomized,[3] and you can count on the same number being chosen each time if you don't load any other (dynamic) modules. Avoiding lengthy scripts is useful during development. But this trick, clearly, doesn't scale to more than one driver at a time.

我们认为,分配主编号的最佳方法是默认动态分配,同时让您可以选择在加载时甚至编译时指定主编号。scull实现就是这样工作的它使用一个全局变量scull_major来保存所选的数字(还有一个scull_minor表示次要数字)。该变量被初始化为,在scull.hSCULL_MAJOR中定义。分布式源中的默认值为,表示“使用动态分配”。用户可以接受默认值或选择特定的主编号,方法是在编译前修改宏或指定一个值SCULL_MAJOR0scull_majorinsmod命令行上。最后,通过使用scull_load脚本,用户可以 在scull_load的命令行上将参数传递给insmod 。[ 4 ]

The best way to assign major numbers, in our opinion, is by defaulting to dynamic allocation while leaving yourself the option of specifying the major number at load time, or even at compile time. The scull implementation works in this way; it uses a global variable, scull_major, to hold the chosen number (there is also a scull_minor for the minor number). The variable is initialized to SCULL_MAJOR, defined in scull.h. The default value of SCULL_MAJOR in the distributed source is 0, which means "use dynamic assignment." The user can accept the default or choose a particular major number, either by modifying the macro before compiling or by specifying a value for scull_major on the insmod command line. Finally, by using the scull_load script, the user can pass arguments to insmod on scull_load 's command line.[4]

这是我们在scull源代码中用于获取主设备号的代码:

Here's the code we use in scull 's source to get a major number:

如果(双桨大调){
    dev = MKDEV(scull_major, scull_minor);
    结果 = register_chrdev_region(dev, scull_nr_devs, "scull");
} 别的 {
    结果 = alloc_chrdev_region(&dev, scull_minor, scull_nr_devs,
            “橹”);
    scull_major = 主要(dev);
}
如果(结果 < 0){
    printk(KERN_WARNING "scull: 无法获取主要 %d\n", scull_major);
    返回结果;
}
if (scull_major) {
    dev = MKDEV(scull_major, scull_minor);
    result = register_chrdev_region(dev, scull_nr_devs, "scull");
} else {
    result = alloc_chrdev_region(&dev, scull_minor, scull_nr_devs,
            "scull");
    scull_major = MAJOR(dev);
}
if (result < 0) {
    printk(KERN_WARNING "scull: can't get major %d\n", scull_major);
    return result;
}

几乎所有示例驱动程序都使用在 本书使用类似的代码来分配主要数字。

Almost all of the sample drivers used in this book use similar code for their major number assignment.

一些重要的数据结构

Some Important Data Structures

可以想象,设备号注册 这只是驱动程序代码必须执行的众多任务中的第一个。我们很快就会看到其他重要的驱动程序组件,但首先需要另一个题外话。大多数基本驱动程序操作都涉及三个重要的内核数据结构,称为file_operationsfileinode。需要对这些结构有基本的熟悉才能完成许多有趣的事情,因此我们现在将快速浏览一下它们中的每一个,然后再详细了解如何实现基本驱动程序操作。

As you can imagine, device number registration is just the first of many tasks that driver code must carry out. We will soon look at other important driver components, but one other digression is needed first. Most of the fundamental driver operations involve three important kernel data structures, called file_operations, file, and inode. A basic familiarity with these structures is required to be able to do much of anything interesting, so we will now take a quick look at each of them before getting into the details of how to implement the fundamental driver operations.

文件操作

File Operations

到目前为止,我们已经预留了一些 e 设备编号供我们使用,但我们尚未将任何驱动程序操作连接到这些编号。该file_operations结构是字符驱动程序如何建立此连接的。该结构在<linux/fs.h>中定义,是函数指针的集合。每个打开的文件(在内部由一个file结构表示,我们稍后将对其进行研究)都与其自己的一组函数相关联(通过包含一个名为 的f_op指向file_operations结构的字段)。这些操作主要负责实现系统调用,因此被命名为openread, 等等。我们可以将文件视为一个“对象”,将对其进行操作的函数视为其“方法”,使用面向对象的编程术语来表示对象声明的作用于自身的操作。这是我们在 Linux 内核中看到的面向对象编程的第一个迹象,我们将在后面的章节中看到更多。

So far, we have reserved som e device numbers for our use, but we have not yet connected any of our driver's operations to those numbers. The file_operations structure is how a char driver sets up this connection. The structure, defined in <linux/fs.h>, is a collection of function pointers. Each open file (represented internally by a file structure, which we will examine shortly) is associated with its own set of functions (by including a field called f_op that points to a file_operations structure). The operations are mostly in charge of implementing the system calls and are therefore, named open, read, and so on. We can consider the file to be an "object" and the functions operating on it to be its "methods," using object-oriented programming terminology to denote actions declared by an object to act on itself. This is the first sign of object-oriented programming we see in the Linux kernel, and we'll see more in later chapters.

按照惯例,file_operations称为结构体或指向结构体的指针fops(或其某种变体)。结构中的每个字段必须指向驱动程序中实现特定操作的函数,或者留给 NULL不支持的操作。指定指针时内核的确切行为NULL对于每个函数都是不同的,如本节后面的列表所示。

Conventionally, a file_operations structure or a pointer to one is called fops (or some variation thereof ). Each field in the structure must point to the function in the driver that implements a specific operation, or be left NULL for unsupported operations. The exact behavior of the kernel when a NULL pointer is specified is different for each function, as the list later in this section shows.

以下列表介绍了应用程序可以在设备上调用的所有操作。我们试图保持列表简短,以便它可以用作参考,仅总结每个操作以及NULL使用指针时的默认内核行为。

The following list introduces all the operations that an application can invoke on a device. We've tried to keep the list brief so it can be used as a reference, merely summarizing each operation and the default kernel behavior when a NULL pointer is used.

当您阅读方法列表时file_operations ,您会注意到许多参数都包含字符串_ _user。该注释是文档的一种形式,指出指针是不能直接取消引用的用户空间地址。对于正常编译_ _user没有影响,但外部检查软件可以使用它来查找用户空间地址的误用。

As you read through the list of file_operations methods, you will note that a number of parameters include the string _ _user. This annotation is a form of documentation, noting that a pointer is a user-space address that cannot be directly dereferenced. For normal compilation, _ _user has no effect, but it can be used by external checking software to find misuse of user-space addresses.

本章的其余部分在描述了其他一些重要的数据结构之后,解释了最重要操作的作用并提供了提示、警告和真实的代码示例。我们将对更复杂的操作的讨论推迟到后面的章节,因为我们还没有准备好深入研究内存管理、阻塞操作和异步通知等主题。

The rest of the chapter, after describing some other important data structures, explains the role of the most important operations and offers hints, caveats, and real code examples. We defer discussion of the more complex operations to later chapters, because we aren't ready to dig into topics such as memory management, blocking operations, and asynchronous notification quite yet.

struct module *owner
struct module *owner

第一个file_operations字段根本不是一个操作;它是指向“拥有”该结构的模块的指针。该字段用于防止模块在其操作正在使用时被卸载。几乎所有时候,它都被简单地初始化为<linux/module.h>THIS_MODULE中定义的宏。

The first file_operations field is not an operation at all; it is a pointer to the module that "owns" the structure. This field is used to prevent the module from being unloaded while its operations are in use. Almost all the time, it is simply initialized to THIS_MODULE, a macro defined in <linux/module.h>.

loff_t (*llseek) (struct file *, loff_t, int);
loff_t (*llseek) (struct file *, loff_t, int);

llseek _ 方法用于更改文件中的当前读/写位置,新位置作为(正)返回值返回。该loff_t参数是一个“长偏移量”,即使在 32 位平台上也至少有 64 位宽。错误由负返回值表示。如果此函数指针为,则查找调用将以潜在不可预测的方式NULL修改结构中的位置计数器file第 3.3.2 节中描述)。

The llseek method is used to change the current read/write position in a file, and the new position is returned as a (positive) return value. The loff_t parameter is a "long offset" and is at least 64 bits wide even on 32-bit platforms. Errors are signaled by a negative return value. If this function pointer is NULL, seek calls will modify the position counter in the file structure (described in Section 3.3.2) in potentially unpredictable ways.

ssize_t (*read) (struct file *, char _ _user *, size_t, loff_t *);
ssize_t (*read) (struct file *, char _ _user *, size_t, loff_t *);

用于从设备检索数据。此位置的空指针会导致 读取系统调用失败-EINVAL(“无效参数”)。非负返回值表示成功读取的字节数(返回值是“有符号大小”类型,通常是目标平台的本机整数类型)。

Used to retrieve data from the device. A null pointer in this position causes the read system call to fail with -EINVAL ("Invalid argument"). A nonnegative return value represents the number of bytes successfully read (the return value is a "signed size" type, usually the native integer type for the target platform).

ssize_t (*aio_read)(struct kiocb *, char _ _user *, size_t, loff_t);
ssize_t (*aio_read)(struct kiocb *, char _ _user *, size_t, loff_t);

启动异步读取——在函数返回之前可能无法完成的读取操作。如果此方法为,则所有操作都将通过readNULL来(同步)处理 。

Initiates an asynchronous read—a read operation that might not complete before the function returns. If this method is NULL, all operations will be processed (synchronously) by read instead.

ssize_t (*write) (struct file *, const char _ _user *, size_t, loff_t *);
ssize_t (*write) (struct file *, const char _ _user *, size_t, loff_t *);

向设备发送数据。如果NULL,-EINVAL返回到调用 write的程序 系统调用。返回值如果非负,则表示成功写入的字节数。

Sends data to the device. If NULL, -EINVAL is returned to the program calling the write system call. The return value, if nonnegative, represents the number of bytes successfully written.

ssize_t (*aio_write)(struct kiocb *, const char _ _user *, size_t, loff_t *);
ssize_t (*aio_write)(struct kiocb *, const char _ _user *, size_t, loff_t *);

在设备上启动异步写入操作。

Initiates an asynchronous write operation on the device.

int (*readdir) (struct file *, void *, filldir_t);
int (*readdir) (struct file *, void *, filldir_t);

该字段应该用于NULL设备文件;它用于读取目录,仅对文件系统有用。

This field should be NULL for device files; it is used for reading directories and is useful only for filesystems.

unsigned int (*poll) (struct file *, struct poll_table_struct *);
unsigned int (*poll) (struct file *, struct poll_table_struct *);

民意调查 method 是三个系统调用的后端: poll 、 epollselect,所有这些调用都用于查询对一个或多个文件描述符的读取或写入是否会阻塞。poll方法应该返回一个位掩码,指示是否可以进行非阻塞读取或写入,并且可能为内核提供可用于使调用进程进入睡眠状态直到 I/O 变为可能的信息如果驱动程序离开其poll方法NULL,则假定该设备既可读又可写,且不会阻塞。

The poll method is the back end of three system calls: poll, epoll, and select, all of which are used to query whether a read or write to one or more file descriptors would block. The poll method should return a bit mask indicating whether non-blocking reads or writes are possible, and, possibly, provide the kernel with information that can be used to put the calling process to sleep until I/O becomes possible. If a driver leaves its poll method NULL, the device is assumed to be both readable and writable without blocking.

int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);

ioctl 系统调用提供了一种发出特定于设备的命令的方法(例如格式化软盘的磁道,该磁道既不读取也不写入)此外,内核可以识别一些ioctlfops命令,而无需参考该表。如果设备不提供ioctl方法,系统调用将针对任何未预定义的请求返回错误(-ENOTTY“设备没有此类 ioctl”)。

The ioctl system call offers a way to issue device-specific commands (such as formatting a track of a floppy disk, which is neither reading nor writing). Additionally, a few ioctl commands are recognized by the kernel without referring to the fops table. If the device doesn't provide an ioctl method, the system call returns an error for any request that isn't predefined (-ENOTTY, "No such ioctl for device").

int (*mmap) (struct file *, struct vm_area_struct *);
int (*mmap) (struct file *, struct vm_area_struct *);

mmap用于请求设备内存到进程地址空间的映射。如果使用此方法NULL,则mmap系统调用将返回-ENODEV

mmap is used to request a mapping of device memory to a process's address space. If this method is NULL, the mmap system call returns -ENODEV.

int (*open) (struct inode *, struct file *);
int (*open) (struct inode *, struct file *);

尽管这始终是对设备文件执行的第一个操作,但驱动程序不需要声明相应的方法。如果此项为NULL,则打开设备始终会成功,但不会通知您的驱动程序。

Though this is always the first operation performed on the device file, the driver is not required to declare a corresponding method. If this entry is NULL, opening the device always succeeds, but your driver isn't notified.

int (*flush) (struct file *);
int (*flush) (struct file *);

花顺 当进程关闭其设备文件描述符的副本时调用操作;它应该执行(并等待)设备上任何未完成的操作。这不能与用户程序请求的fsync操作混淆。目前,flush在很少的驱动程序中使用;例如,SCSI 磁带驱动程序使用它来确保所有写入的数据在设备关闭之前都已写入磁带。如果flushNULL,内核将简单地忽略用户应用程序请求。

The flush operation is invoked when a process closes its copy of a file descriptor for a device; it should execute (and wait for) any outstanding operations on the device. This must not be confused with the fsync operation requested by user programs. Currently, flush is used in very few drivers; the SCSI tape driver uses it, for example, to ensure that all data written makes it to the tape before the device is closed. If flush is NULL, the kernel simply ignores the user application request.

int (*release) (struct inode *, struct file *);
int (*release) (struct inode *, struct file *);

file当结构被释放时调用此操作。喜欢打开释放即可 NULL[ 5 ]

This operation is invoked when the file structure is being released. Like open, release can be NULL.[5]

int (*fsync) (struct file *, struct dentry *, int);
int (*fsync) (struct file *, struct dentry *, int);

此方法是fsync系统调用的后端,用户调用该方法来刷新任何挂起的数据。如果该指针为NULL,则系统调用返回-EINVAL

This method is the back end of the fsync system call, which a user calls to flush any pending data. If this pointer is NULL, the system call returns -EINVAL.

int (*aio_fsync)(struct kiocb *, int);
int (*aio_fsync)(struct kiocb *, int);

这是fsync方法的异步版本 。

This is the asynchronous version of the fsync method.

int (*fasync) (int, struct file *, int);
int (*fasync) (int, struct file *, int);

该操作用于通知设备其状态发生变化FASYNC 旗帜。异步通知是一个高级主题,将在第 6 章中进行描述。NULL如果驱动程序不支持异步通知,则该字段可以是。

This operation is used to notify the device of a change in its FASYNC flag. Asynchronous notification is an advanced topic and is described in Chapter 6. The field can be NULL if the driver doesn't support asynchronous notification.

int (*lock) (struct file *, int, struct file_lock *);
int (*lock) (struct file *, int, struct file_lock *);

lock方法用于实现文件锁定;锁定是常规文件不可或缺的功能,但设备驱动程序几乎从未实现过。

The lock method is used to implement file locking; locking is an indispensable feature for regular files but is almost never implemented by device drivers.

ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *);

ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *);
ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *);

ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *);

这些方法实现分散/聚集读写操作。应用程序有时需要执行涉及多个内存区域的单个读或写操作;这些系统调用允许他们这样做,而无需对数据强制执行额外的复制操作。如果这些函数指针处于 left 状态NULL,则调用readwrite 方法(可能不止一次)。

These methods implement scatter/gather read and write operations. Applications occasionally need to do a single read or write operation involving multiple memory areas; these system calls allow them to do so without forcing extra copy operations on the data. If these function pointers are left NULL, the read and write methods are called (perhaps more than once) instead.

ssize_t (*sendfile)(struct file *, loff_t *, size_t, read_actor_t, void *);
ssize_t (*sendfile)(struct file *, loff_t *, size_t, read_actor_t, void *);

该方法实现了sendfile的读取端 系统调用,它通过最少的复制将数据从一个文件描述符移动到另一个文件描述符。例如,它由需要通过网络连接发送文件内容的 Web 服务器使用。设备驱动程序通常会留下sendfile NULL

This method implements the read side of the sendfile system call, which moves the data from one file descriptor to another with a minimum of copying. It is used, for example, by a web server that needs to send the contents of a file out a network connection. Device drivers usually leave sendfile NULL.

ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *,

int);
ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *,

int);

发送页面 是sendfile的另一半;内核调用它来将数据(一次一页)发送到相应的文件。设备驱动程序通常不实现sendpage

sendpage is the other half of sendfile; it is called by the kernel to send data, one page at a time, to the corresponding file. Device drivers do not usually implement sendpage.

unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned

long, unsigned long, unsigned long);
unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned

long, unsigned long, unsigned long);

这样做的目的 方法是在进程的地址空间中找到合适的位置来映射到底层设备上的内存段中。该任务通常由内存管理代码执行;此方法的存在是为了允许驱动程序强制执行特定设备可能具有的任何对齐要求。大多数司机可以离开这个方法 NULL

The purpose of this method is to find a suitable location in the process's address space to map in a memory segment on the underlying device. This task is normally performed by the memory management code; this method exists to allow drivers to enforce any alignment requirements a particular device may have. Most drivers can leave this method NULL.

int (*check_flags)(int)
int (*check_flags)(int)

该方法允许模块 检查传递给fcntl(F_SETFL...)调用的标志。

This method allows a module to check the flags passed to an fcntl(F_SETFL...) call.

int (*dir_notify)(struct file *, unsigned long);
int (*dir_notify)(struct file *, unsigned long);

当一个 应用程序使用fcntl 请求目录更改通知。它仅对文件系统有用;驱动程序不需要实现dir_notify

This method is invoked when an application uses fcntl to request directory change notifications. It is useful only to filesystems; drivers need not implement dir_notify.

scull设备驱动程序仅实现最重要的设备方法。其file_operations结构初始化如下:

The scull device driver implements only the most important device methods. Its file_operations structure is initialized as follows:

结构文件操作 scull_fops = {
    .owner = THIS_MODULE,
    .llseek = scull_llseek,
    .read = scull_read,
    .write = scull_write,
    .ioctl = scull_ioctl,
    .open = scull_open,
    .release = scull_release,
};
struct file_operations scull_fops = {
    .owner =    THIS_MODULE,
    .llseek =   scull_llseek,
    .read =     scull_read,
    .write =    scull_write,
    .ioctl =    scull_ioctl,
    .open =     scull_open,
    .release =  scull_release,
};

该声明使用标准 C 标记结构初始化语法。这种语法是首选,因为它使驱动程序在结构定义的更改中更可移植,并且可以说使代码更加紧凑和可读。标记初始化允许对结构成员进行重新排序;在某些情况下,通过将指向经常访问的成员的指针放在同一目录中,可以实现显着的性能改进。 硬件缓存线。

This declaration uses the standard C tagged structure initialization syntax. This syntax is preferred because it makes drivers more portable across changes in the definitions of the structures and, arguably, makes the code more compact and readable. Tagged initialization allows the reordering of structure members; in some cases, substantial performance improvements have been realized by placing pointers to frequently accessed members in the same hardware cache line.

文件结构

The file Structure

struct file,在<linux/fs.h>中定义,是 设备驱动程序中使用的第二重要的数据结构。请注意,a与用户空间程序的指针file无关 。FILEAFILE在 C 库中定义,从未出现在内核代码中。struct file另一方面,A是一个永远不会出现在用户程序中的内核结构。

struct file, defined in <linux/fs.h>, is the second most important data structure used in device drivers. Note that a file has nothing to do with the FILE pointers of user-space programs. A FILE is defined in the C library and never appears in kernel code. A struct file, on the other hand, is a kernel structure that never appears in user programs.

file结构代表一个打开的文件 。(它不是特定于设备驱动程序的;系统中的每个打开的文件struct file在内核空间中都有一个关联。)它由内核在 打开时创建,并传递给对该文件进行操作的任何函数,直到最后一次关闭。文件的所有实例关闭后,内核释放数据结构。

The file structure represents an open file . (It is not specific to device drivers; every open file in the system has an associated struct file in kernel space.) It is created by the kernel on open and is passed to any function that operates on the file, until the last close. After all instances of the file are closed, the kernel releases the data structure.

在内核源代码中,指向的指针struct file通常称为fileor filp(“文件指针”)。我们将一致地调用该指针,filp以防止结构本身出现歧义。因此,file 指的是该结构以及filp指向该结构的指针。

In the kernel sources, a pointer to struct file is usually called either file or filp ("file pointer"). We'll consistently call the pointer filp to prevent ambiguities with the structure itself. Thus, file refers to the structure and filp to a pointer to the structure.

struct file这里显示了最重要的字段。与上一节一样,第一次阅读时可以跳过该列表。然而,在本章后面,当我们面对一些真正的 C 代码时,我们将更详细地讨论这些字段。

The most important fields of struct file are shown here. As in the previous section, the list can be skipped on a first reading. However, later in this chapter, when we face some real C code, we'll discuss the fields in more detail.

mode_t f_mode;
mode_t f_mode;

FMODE_READ文件模式通过位和 来标识文件为可读或可写(或两者) FMODE_WRITE。您可能想在openioctl函数中检查此字段的读/写权限,但不需要检查readwrite权限,因为内核会在调用您的方法之前进行检查。当文件尚未打开以进行该类型的访问时,读取或写入的尝试将被拒绝,而驱动程序甚至不知道这一点。

The file mode identifies the file as either readable or writable (or both), by means of the bits FMODE_READ and FMODE_WRITE. You might want to check this field for read/write permission in your open or ioctl function, but you don't need to check permissions for read and write, because the kernel checks before invoking your method. An attempt to read or write when the file has not been opened for that type of access is rejected without the driver even knowing about it.

loff_t f_pos;
loff_t f_pos;

当前读取或写入位置。 loff_t是所有平台上的 64 位值(long longgcc 术语中)。如果驱动程序需要知道文件中的当前位置但通常不应更改它,则可以读取此值;读取写入应该使用它们作为最后一个参数接收的指针来更新位置,而不是filp->f_pos直接操作。此规则的一个例外是 llseek方法,其目的是更改文件位置。

The current reading or writing position. loff_t is a 64-bit value on all platforms (long long in gcc terminology). The driver can read this value if it needs to know the current position in the file but should not normally change it; read and write should update a position using the pointer they receive as the last argument instead of acting on filp->f_pos directly. The one exception to this rule is in the llseek method, the purpose of which is to change the file position.

unsigned int f_flags;
unsigned int f_flags;

这些是文件标志,例如 O_RDONLYO_NONBLOCKO_SYNC。驱动程序应该检查该O_NONBLOCK标志以查看是否已请求非阻塞操作(我们在第 6.2.3 节中讨论非阻塞 I/O );其他标志很少使用。特别是,应该使用f_mode而不是检查读/写权限f_flags。所有标志都在头文件<linux/fcntl.h>中定义。

These are the file flags, such as O_RDONLY, O_NONBLOCK, and O_SYNC. A driver should check the O_NONBLOCK flag to see if nonblocking operation has been requested (we discuss nonblocking I/O in Section 6.2.3); the other flags are seldom used. In particular, read/write permission should be checked using f_mode rather than f_flags. All the flags are defined in the header <linux/fcntl.h>.

struct file_operations *f_op;
struct file_operations *f_op;

与文件相关的操作。内核分配该指针作为其 open实现的一部分,然后在需要分派任何操作时读取它。filp->f_op内核永远不会保存in 中的值以供以后参考;这意味着您可以更改与您的文件关联的文件操作,并且新方法将在您返回到调用者后生效。例如,与主编号 1 关联的open代码 (/dev/null/dev/zero等)替换了中的操作filp->f_op取决于打开的次要号码。这种做法允许在同一主编号下实现多个行为,而不会在每个系统调用中引入开销。替换文件操作的能力在内核中相当于面向对象编程中的“方法重写”。

The operations associated with the file. The kernel assigns the pointer as part of its implementation of open and then reads it when it needs to dispatch any operations. The value in filp->f_op is never saved by the kernel for later reference; this means that you can change the file operations associated with your file, and the new methods will be effective after you return to the caller. For example, the code for open associated with major number 1 (/dev/null, /dev/zero, and so on) substitutes the operations in filp->f_op depending on the minor number being opened. This practice allows the implementation of several behaviors under the same major number without introducing overhead at each system call. The ability to replace the file operations is the kernel equivalent of "method overriding" in object-oriented programming.

void *private_data;
void *private_data;

open 系统调用在调用驱动程序的 open 方法之前设置NULL此指针。您可以自由地使用该字段或忽略它;您可以使用该字段来指向已分配的数据,但是您必须记住在内核销毁该结构之前在release方法中释放该内存。是跨系统调用保存状态信息的有用资源,我们的大多数示例模块都使用它。fileprivate_data

The open system call sets this pointer to NULL before calling the open method for the driver. You are free to make its own use of the field or to ignore it; you can use the field to point to allocated data, but then you must remember to free that memory in the release method before the file structure is destroyed by the kernel. private_data is a useful resource for preserving state information across system calls and is used by most of our sample modules.

struct dentry *f_dentry;
struct dentry *f_dentry;

与文件关联的目录项 ( dentry ) 结构。设备驱动程序编写者通常不需要关心 dentry 结构,除了inodefilp->f_dentry->d_inode.

The directory entry (dentry) structure associated with the file. Device driver writers normally need not concern themselves with dentry structures, other than to access the inode structure as filp->f_dentry->d_inode.

真实的结构还有更多字段,但它们对设备驱动程序没有用处。我们可以安全地忽略这些字段,因为驱动程序从不创建file结构;他们只能访问其他地方创建的结构。

The real structure has a few more fields, but they aren't useful to device drivers. We can safely ignore those fields, because drivers never create file structures; they only access structures created elsewhere.

索引节点结构

The inode Structure

索引节点结构是 所使用的内核内部表示文件。file因此,它与表示打开文件描述符的结构不同。可以有许多file结构代表单个文件上的多个打开描述符,但它们都指向单个inode结构。

The inode structure is used by the kernel internally to represent files. Therefore, it is different from the file structure that represents an open file descriptor. There can be numerous file structures representing multiple open descriptors on a single file, but they all point to a single inode structure.

inode结构包含有关文件的大量信息。作为一般规则,该结构中只有两个字段对编写驱动程序代码感兴趣:

The inode structure contains a great deal of information about the file. As a general rule, only two fields of this structure are of interest for writing driver code:

dev_t i_rdev;
dev_t i_rdev;

对于代表设备文件的 inode,该字段包含实际的设备号。

For inodes that represent device files, this field contains the actual device number.

struct cdev *i_cdev;
struct cdev *i_cdev;

struct cdev是内核的内部结构,代表字符设备;当 inode 引用字符设备文件时,该字段包含指向该结构的指针。

struct cdev is the kernel's internal structure that represents char devices; this field contains a pointer to that structure when the inode refers to a char device file.

在 2.5 开发系列的过程中,类型i_rdev发生了变化,破坏了许多驱动程序。作为鼓励更多可移植编程的一种方式,内核开发人员添加了两个宏,可用于从 inode 获取主编号和次编号:

The type of i_rdev changed over the course of the 2.5 development series, breaking a lot of drivers. As a way of encouraging more portable programming, the kernel developers have added two macros that can be used to obtain the major and minor number from an inode:

无符号 int imminor(struct inode *inode);
无符号 int imajor(struct inode *inode);
unsigned int iminor(struct inode *inode);
unsigned int imajor(struct inode *inode);

为了不被下一个更改所捕获,应该使用这些宏而不是i_rdev直接操作。

In the interest of not being caught by the next change, these macros should be used instead of manipulating i_rdev directly.

字符设备注册

Char Device Registration

正如我们提到的,内核使用 struct cdev内部表示字符设备的类型结构。在内核调用设备的操作之前,您必须分配并注册这些结构中的一个或多个。[ 6 ]为此,您的代码应包含<linux/cdev.h>,其中定义了结构及其关联的辅助函数。

As we mentioned, the kernel uses structures of type struct cdev to represent char devices internally. Before the kernel invokes your device's operations, you must allocate and register one or more of these structures.[6] To do so, your code should include <linux/cdev.h>, where the structure and its associated helper functions are defined.

有两种方法可以分配和初始化这些结构之一。如果您希望cdev在运行时获得独立的结构,您可以使用以下代码来实现:

There are two ways of allocating and initializing one of these structures. If you wish to obtain a standalone cdev structure at runtime, you may do so with code such as:

结构 cdev *my_cdev = cdev_alloc( );
my_cdev->ops = &my_fops;
struct cdev *my_cdev = cdev_alloc(  );
my_cdev->ops = &my_fops;

然而,您可能希望将该cdev结构嵌入到您自己的特定于设备的结构中;这就是 双桨的作用。在这种情况下,您应该初始化已经分配的结构:

Chances are, however, that you will want to embed the cdev structure within a device-specific structure of your own; that is what scull does. In that case, you should initialize the structure that you have already allocated with:

void cdev_init(struct cdev *cdev, struct file_operations *fops);
void cdev_init(struct cdev *cdev, struct file_operations *fops);

struct cdev无论哪种方式,您都需要初始化另一个字段。与file_operations结构一样,struct cdev有一个owner 应设置为 的字段THIS_MODULE

Either way, there is one other struct cdev field that you need to initialize. Like the file_operations structure, struct cdev has an owner field that should be set to THIS_MODULE.

一旦cdev 结构设置完毕后,最后一步是通过调用以下命令告诉内核:

Once the cdev structure is set up, the final step is to tell the kernel about it with a call to:

int cdev_add(struct cdev *dev, dev_t num, unsigned int count);
int cdev_add(struct cdev *dev, dev_t num, unsigned int count);

这里,devcdev结构体,num是该设备响应的第一个设备号,count是应该与该设备关联的设备号的数量。通常count是一个,但在某些情况下,有多个设备编号对应于一个特定设备是有意义的。例如,考虑一下 SCSI 磁带驱动程序,它允许用户空间通过为每个物理设备分配多个次要编号来选择操作模式(例如密度)。

Here, dev is the cdev structure, num is the first device number to which this device responds, and count is the number of device numbers that should be associated with the device. Often count is one, but there are situations where it makes sense to have more than one device number correspond to a specific device. Consider, for example, the SCSI tape driver, which allows user space to select operating modes (such as density) by assigning multiple minor numbers to each physical device.

使用cdev_add时需要记住一些重要的事情 。首先是这个调用可能会失败。如果它返回负错误代码,则表明您的设备尚未添加到系统中。然而,它几乎总是成功,这就引出了另一点:一旦 cdev_add返回,您的设备就处于“活动”状态,并且内核可以调用其操作。在驱动程序完全准备好处理设备上的操作之前,不应调用cdev_add 。

There are a couple of important things to keep in mind when using cdev_add. The first is that this call can fail. If it returns a negative error code, your device has not been added to the system. It almost always succeeds, however, and that brings up the other point: as soon as cdev_add returns, your device is "live" and its operations can be called by the kernel. You should not call cdev_add until your driver is completely ready to handle operations on the device.

要从系统中删除字符设备,请调用:

To remove a char device from the system, call:

无效cdev_del(结构cdev * dev);
void cdev_del(struct cdev *dev);

显然,您不应cdev在将结构传递给cdev_del后访问该结构。

Clearly, you should not access the cdev structure after passing it to cdev_del.

scull 中的设备注册

Device Registration in scull

在内部,scull代表每个设备类型 的结构struct scull_dev。该结构定义为:

Internally, scull represents each device with a structure of type struct scull_dev. This structure is defined as:

结构 scull_dev {
    结构 scull_qset *数据;/* 指向第一个量子集的指针 */
    整数量子;/* 当前量子大小 */
    int qset;/* 当前数组大小 */
    无符号长尺寸;/* 这里存储的数据量 */
    无符号整型访问密钥;/* 由 sculluid 和 scullpriv 使用 */
    结构信号量 sem;/* 互斥信号量 */
    结构体cdev cdev;/* Char设备结构*/
};
struct scull_dev {
    struct scull_qset *data;  /* Pointer to first quantum set */
    int quantum;              /* the current quantum size */
    int qset;                 /* the current array size */
    unsigned long size;       /* amount of data stored here */
    unsigned int access_key;  /* used by sculluid and scullpriv */
    struct semaphore sem;     /* mutual exclusion semaphore     */
    struct cdev cdev;     /* Char device structure      */
};

我们讨论这个结构中的各个字段,但现在,我们请注意将cdev我们struct cdev的设备连接到内核的 。该结构必须按上述方法初始化并添加到系统中; 处理此任务的scull代码是:

We discuss the various fields in this structure as we come to them, but for now, we call attention to cdev, the struct cdev that interfaces our device to the kernel. This structure must be initialized and added to the system as described above; the scull code that handles this task is:

静态无效 scull_setup_cdev(结构 scull_dev *dev, int 索引)
{
    int err, devno = MKDEV(scull_major, scull_minor + 索引);
    
    cdev_init(&dev->cdev, &scull_fops);
    dev->cdev.owner = THIS_MODULE;
    dev->cdev.ops = &scull_fops;
    err = cdev_add(&dev->cdev, devno, 1);
    /* 如果需要则优雅地失败 */
    如果(错误)
    printk(KERN_NOTICE "添加 scull%d 时出错 %d", err, index);
}
static void scull_setup_cdev(struct scull_dev *dev, int index)
{
    int err, devno = MKDEV(scull_major, scull_minor + index);
    
    cdev_init(&dev->cdev, &scull_fops);
    dev->cdev.owner = THIS_MODULE;
    dev->cdev.ops = &scull_fops;
    err = cdev_add (&dev->cdev, devno, 1);
    /* Fail gracefully if need be */
    if (err)
    printk(KERN_NOTICE "Error %d adding scull%d", err, index);
}

由于该cdev结构嵌入在 中 struct scull_dev,因此必须调用cdev_init来执行该结构的初始化。

Since the cdev structure is embedded within struct scull_dev, cdev_init must be called to perform the initialization of that structure.

老方法

The Older Way

如果您仔细研究 2.6 内核中的大量驱动程序代码,您可能会注意到相当多的字符驱动程序不使用cdev我们刚刚描述的接口。您看到的是尚未升级到 2.6 界面的旧代码。由于该代码按原样工作,因此这种升级可能很长时间不会发生。为了完整起见,我们描述 较旧的字符设备注册接口,但新代码不应使用它;这种机制可能会在未来的内核中消失。

If you dig through much driver code in the 2.6 kernel, you may notice that quite a few char drivers do not use the cdev interface that we have just described. What you are seeing is older code that has not yet been upgraded to the 2.6 interface. Since that code works as it is, this upgrade may not happen for a long time. For completeness, we describe the older char device registration interface, but new code should not use it; this mechanism will likely go away in a future kernel.

注册字符设备驱动程序的经典方法是:

The classic way to register a char device driver is with:

int register_chrdev(无符号 int 主要, const char *名称,
                    结构文件操作 *fops);
int register_chrdev(unsigned int major, const char *name,
                    struct file_operations *fops);

这里,major是主要数量interestname是驱动程序的名称(出现在/proc/devices中),并且 fops是默认file_operations结构。对register_chrdev 的调用会 为给定的 注册次要编号 0-255 ,并为每个major设置默认结构。cdev使用此接口的驱动程序必须准备好处理所有 256 个次设备号上的开放调用(无论它们是否对应于真实设备),并且它们不能使用大于 255 的主设备号或次设备号。

Here, major is the major number of interest, name is the name of the driver (it appears in /proc/devices), and fops is the default file_operations structure. A call to register_chrdev registers minor numbers 0-255 for the given major, and sets up a default cdev structure for each. Drivers using this interface must be prepared to handle open calls on all 256 minor numbers (whether they correspond to real devices or not), and they cannot use major or minor numbers greater than 255.

如果您使用register_chrdev,从系统中删除设备的正确函数是:

If you use register_chrdev, the proper function to remove your device(s) from the system is:

int unregister_chrdev(unsigned int Major, const char *name);
int unregister_chrdev(unsigned int major, const char *name);

major并且name 必须是 与传递给register_chrdev的相同 ,否则调用将失败。

major and name must be the same as those passed to register_chrdev, or the call will fail.

打开并释放

open and release

现在我们已经快速浏览了这些字段,我们开始在真正的 scull函数中使用它们。

Now that we've taken a quick look at the fields, we start using them in real scull functions.

开放式方法

The open Method

提供的open方法 是 驱动程序进行任何初始化,为以后的操作做准备。在大多数驱动程序中, open应执行以下任务:

The open method is provided for a driver to do any initialization in preparation for later operations. In most drivers, open should perform the following tasks:

  • 检查设备特定的错误(例如设备未就绪或类似的硬件问题)

  • Check for device-specific errors (such as device-not-ready or similar hardware problems)

  • 如果是第一次打开设备,则初始化设备

  • Initialize the device if it is being opened for the first time

  • f_op如有必要,更新指针

  • Update the f_op pointer, if necessary

  • 分配并填充要放入的任何数据结构filp->private_data

  • Allocate and fill any data structure to be put in filp->private_data

然而,首要任务通常是确定正在打开哪个设备。请记住open方法的原型是:

The first order of business, however, is usually to identify which device is being opened. Remember that the prototype for the open method is:

int (*open)(struct inode *inode, struct file *filp);
int (*open)(struct inode *inode, struct file *filp);

inode参数以字段的形式提供了我们需要的信息 i_cdev,其中包含cdev我们之前设置的结构。唯一的问题是我们通常不需要结构cdev本身,我们想要scull_dev包含该cdev结构的结构。C 语言允许程序员使用各种技巧来进行这种转换;然而,编写此类技巧很容易出错,并且会导致其他人难以阅读和理解代码。幸运的是,在这种情况下,内核黑客以<linux/kernel.h>中定义的container_of宏的形式为我们完成了棘手的事情:

The inode argument has the information we need in the form of its i_cdev field, which contains the cdev structure we set up before. The only problem is that we do not normally want the cdev structure itself, we want the scull_dev structure that contains that cdev structure. The C language lets programmers play all sorts of tricks to make that kind of conversion; programming such tricks is error prone, however, and leads to code that is difficult for others to read and understand. Fortunately, in this case, the kernel hackers have done the tricky stuff for us, in the form of the container_of macro, defined in <linux/kernel.h>:

容器_of(指针,容器类型,容器字段);
container_of(pointer, container_type, container_field);

该宏采用类型结构内的类型pointer字段 ,并返回指向包含结构的指针。在scull_open中,该宏用于查找适当的设备结构:container_fieldcontainer_type

This macro takes a pointer to a field of type container_field, within a structure of type container_type, and returns a pointer to the containing structure. In scull_open, this macro is used to find the appropriate device structure:

结构 scull_dev *dev; /* 设备信息 */

dev = container_of(inode->i_cdev, struct scull_dev, cdev);
filp->private_data = dev; /* 对于其他方法 */
struct scull_dev *dev; /* device information */

dev = container_of(inode->i_cdev, struct scull_dev, cdev);
filp->private_data = dev; /* for other methods */

一旦找到该scull_dev结构, scull就会将指向该结构的指针存储在private_datafile结构的字段中,以便将来更容易访问。

Once it has found the scull_dev structure, scull stores a pointer to it in the private_data field of the file structure for easier access in the future.

识别正在打开的设备的另一种方法是查看存储在结构中的次设备号inode如果您使用register_chrdev注册设备,则必须使用此技术。请务必使用iminor从结构中获取次要编号inode,并确保它对应于您的驱动程序实际准备处理的设备。

The other way to identify the device being opened is to look at the minor number stored in the inode structure. If you register your device with register_chrdev, you must use this technique. Be sure to use iminor to obtain the minor number from the inode structure, and make sure that it corresponds to a device that your driver is actually prepared to handle.

scull_open的(稍微简化的)代码是:

The (slightly simplified) code for scull_open is:

int scull_open(结构 inode *inode, 结构文件 *filp)
{
    结构 scull_dev *dev; /* 设备信息 */

    dev = container_of(inode->i_cdev, struct scull_dev, cdev);
    filp->private_data = dev; /* 对于其他方法 */

    /* 如果 open 是只写的,现在将设备的长度修剪为 0 */
    if ( (filp->f_flags & O_ACCMODE) == O_WRONLY) {
        scull_trim(开发);/* 忽略错误 */
    }
    返回0;/* 成功 */
}
int scull_open(struct inode *inode, struct file *filp)
{
    struct scull_dev *dev; /* device information */

    dev = container_of(inode->i_cdev, struct scull_dev, cdev);
    filp->private_data = dev; /* for other methods */

    /* now trim to 0 the length of the device if open was write-only */
    if ( (filp->f_flags & O_ACCMODE) =  = O_WRONLY) {
        scull_trim(dev); /* ignore errors */
    }
    return 0;          /* success */
}

该代码看起来非常稀疏,因为它在调用open时不执行任何特定的设备处理。它不需要,因为 scull设备是全局的并且在设计上是持久的。具体来说,没有诸如“首次打开时初始化设备”之类的操作,因为我们不保留scull的打开计数。

The code looks pretty sparse, because it doesn't do any particular device handling when open is called. It doesn't need to, because the scull device is global and persistent by design. Specifically, there's no action such as "initializing the device on first open," because we don't keep an open count for sculls.

在设备上执行的唯一实际操作是在打开设备进行写入时将其截断为长度 0。执行此操作是因为,按照设计,用较短的文件覆盖 scull设备会导致更短的设备数据区域。这类似于打开常规文件进行写入并将其截断为零长度的方式。如果打开设备进行读取,则该操作不会执行任何操作。

The only real operation performed on the device is truncating it to a length of 0 when the device is opened for writing. This is performed because, by design, overwriting a scull device with a shorter file results in a shorter device data area. This is similar to the way opening a regular file for writing truncates it to zero length. The operation does nothing if the device is opened for reading.

稍后我们将看到如何真正的初始化 当我们查看其他scull个性的代码时,它就起作用了 。

We'll see later how a real initialization works when we look at the code for the other scull personalities.

释放方法

The release Method

释放方法的作用是 打开的相反 。有时您会发现调用的是方法实现而不是. 无论哪种方式,设备方法都应该执行以下任务:device _closedevice _release

The role of the release method is the reverse of open. Sometimes you'll find that the method implementation is called device _close instead of device _release. Either way, the device method should perform the following tasks:

  • 取消分配任何打开分配的内容filp->private_data

  • Deallocate anything that open allocated in filp->private_data

  • 最后一次关闭时关闭设备

  • Shut down the device on last close

scull的基本形式没有需要关闭的硬件,因此所需的代码很少:[ 7 ]

The basic form of scull has no hardware to shut down, so the code required is minimal:[7]

int scull_release(结构 inode *inode, 结构文件 *filp)
{
    返回0;
}
int scull_release(struct inode *inode, struct file *filp)
{
    return 0;
}

您可能想知道当设备文件关闭次数多于打开次数时会发生什么。毕竟,dupfork系统调用会创建打开文件的副本,而无需调用open;这些副本中的每一个都会在程序终止时关闭。例如,大多数程序不会打开它们的 标准输入文件(或设备),但它们最终都会关闭它。驱动程序如何知道打开的设备文件何时真正被关闭?

You may be wondering what happens when a device file is closed more times than it is opened. After all, the dup and fork system calls create copies of open files without calling open; each of those copies is then closed at program termination. For example, most programs don't open their stdin file (or device), but all of them end up closing it. How does a driver know when an open device file has really been closed?

答案很简单:并不是每个close系统调用都会导致 调用release方法。只有实际释放设备数据结构的调用才会调用该方法——因此得名。file内核保留一个结构被使用次数的计数器。forkdup都不会 创建新的file结构(只有open才会这样做);他们只是增加现有结构中的计数器。仅当 结构的计数器下降file0,当结构被破坏时就会发生。release方法和close系统调用之间的这种关系保证了您的驱动程序对于每个 open只看到一次release调用。

The answer is simple: not every close system call causes the release method to be invoked. Only the calls that actually release the device data structure invoke the method—hence its name. The kernel keeps a counter of how many times a file structure is being used. Neither fork nor dup creates a new file structure (only open does that); they just increment the counter in the existing structure. The close system call executes the release method only when the counter for the file structure drops to 0, which happens when the structure is destroyed. This relationship between the release method and the close system call guarantees that your driver sees only one release call for each open.

请注意, 每次应用程序调用close时都会调用flush方法。然而,很少有驱动程序实现 刷新,因为通常在关闭时没有什么可执行的,除非涉及释放。

Note that the flush method is called every time an application calls close. However, very few drivers implement flush, because usually there's nothing to perform at close time unless release is involved.

正如您可能想象的那样,即使应用程序终止而没有显式关闭其打开的文件,前面的讨论也适用:内核通过内部使用 close 系统调用在进程退出时自动关闭任何文件

As you may imagine, the previous discussion applies even when the application terminates without explicitly closing its open files: the kernel automatically closes any file at process exit time by internally using the close system call.

scull 的内存使用情况

scull's Memory Usage

在介绍阅读和之前 操作之前,我们最好看看 scull 是如何以及为何进行内存分配的。需要“如何”来彻底理解代码,“为什么”则展示了驱动程序编写者需要做出的选择,尽管scull绝对不是典型的设备。

Before introducing the read and write operations, we'd better look at how and why scull performs memory allocation. "How" is needed to thoroughly understand the code, and "why" demonstrates the kind of choices a driver writer needs to make, although scull is definitely not typical as a device.

本节仅讨论scull中的内存分配策略 ,并不展示编写真正的驱动程序所需的硬件管理技能。这些技能将在第 9 章和第 10 章中介绍。因此,如果您对了解面向内存的 scull驱动程序的内部工作原理不感兴趣,则可以跳过本节。

This section deals only with the memory allocation policy in scull and doesn't show the hardware management skills you need to write real drivers. These skills are introduced in Chapter 9 and Chapter 10. Therefore, you can skip this section if you're not interested in understanding the inner workings of the memory-oriented scull driver.

scull使用的内存区域(也称为 设备)的长度是可变的。你写得越多,它就越长;修剪是通过用较短的文件覆盖设备来执行的。

The region of memory used by scull, also called a device, is variable in length. The more you write, the more it grows; trimming is performed by overwriting the device with a shorter file.

scull驱动程序引入了两个用于管理 Linux 内核内存的核心函数。这些函数定义在<linux/slab.h>中,它们是:

The scull driver introduces two core functions used to manage memory in the Linux kernel. These functions, defined in <linux/slab.h>, are:

void *kmalloc(size_t 大小,int 标志);
无效kfree(无效* ptr);
void *kmalloc(size_t size, int flags);
void kfree(void *ptr);

对kmalloc 的调用尝试分配size内存字节;返回值是指向该内存的指针,或者NULL如果分配失败。该flags参数用于描述内存应该如何分配;我们将在第 8 章中详细研究这些标志。目前,我们总是使用GFP_KERNEL. 分配的内存应使用kfree释放。您永远不应该将任何 不是从kmalloc获得的东西传递给kfree。但是,将指针传递给 kfree是合法的。NULL

A call to kmalloc attempts to allocate size bytes of memory; the return value is a pointer to that memory or NULL if the allocation fails. The flags argument is used to describe how the memory should be allocated; we examine those flags in detail in Chapter 8. For now, we always use GFP_KERNEL. Allocated memory should be freed with kfree. You should never pass anything to kfree that was not obtained from kmalloc. It is, however, legal to pass a NULL pointer to kfree.

kmalloc不是分配大面积内存的最有效方法(请参阅第 8 章),因此为scull选择的实现 并不是一个特别聪明的实现。智能实现的源代码将更难以阅读,本节的目的是展示 读取写入,而不是内存管理。这就是为什么代码只使用kmallockfree而没有诉诸整个页面的分配,尽管这种方法会更有效。

kmalloc is not the most efficient way to allocate large areas of memory (see Chapter 8), so the implementation chosen for scull is not a particularly smart one. The source code for a smart implementation would be more difficult to read, and the aim of this section is to show read and write, not memory management. That's why the code just uses kmalloc and kfree without resorting to allocation of whole pages, although that approach would be more efficient.

另一方面,出于哲学原因和实际原因,我们不想限制“设备”区域的大小。从哲学上讲,对所管理的数据项施加任意限制总是一个坏主意。实际上,scull可用于暂时耗尽系统内存,以便在内存不足的情况下运行测试。运行此类测试可能会帮助您了解系统的内部结构。您可以使用命令cp /dev/zero /dev/scull0来使用scull占用所有实际 RAM ,并且可以使用dd实用程序来选择将多少数据复制到 scull设备。

On the flip side, we didn't want to limit the size of the "device" area, for both a philosophical reason and a practical one. Philosophically, it's always a bad idea to put arbitrary limits on data items being managed. Practically, scull can be used to temporarily eat up your system's memory in order to run tests under low-memory conditions. Running such tests might help you understand the system's internals. You can use the command cp /dev/zero /dev/scull0 to eat all the real RAM with scull, and you can use the dd utility to choose how much data is copied to the scull device.

scull中,每个设备都是一个链表 指针,每个指针都指向一个scull_dev结构。默认情况下,每个此类结构可以通过中间指针数组引用最多 400 万字节。发布的源代码使用了一个由 1000 个指针组成的数组,指向 4000 字节的区域。我们将每个存储区域称为一个量子,将数组(或其长度)称为一个量子集 scull设备 及其内存区域如图 3-1所示。

In scull, each device is a linked list of pointers, each of which points to a scull_dev structure. Each such structure can refer, by default, to at most four million bytes, through an array of intermediate pointers. The released source uses an array of 1000 pointers to areas of 4000 bytes. We call each memory area a quantum and the array (or its length) a quantum set . A scull device and its memory areas are shown in Figure 3-1.

scull 设备的布局

图 3-1。scull 设备的布局

Figure 3-1. The layout of a scull device

选择的数字使得在scull 中写入单个字节会消耗 8000 或 12,000 千字节的内存:量子为 4000,量子集为 4000 或 8000(根据指针在目标上以 32 位还是 64 位表示)平台)。相反,如果写入大量数据,链表的开销也不会太糟糕。每四兆字节的数据只有一个列表元素,设备的最大大小受到计算机内存大小的限制。

The chosen numbers are such that writing a single byte in scull consumes 8000 or 12,000 thousand bytes of memory: 4000 for the quantum and 4000 or 8000 for the quantum set (according to whether a pointer is represented in 32 bits or 64 bits on the target platform). If, instead, you write a huge amount of data, the overhead of the linked list is not too bad. There is only one list element for every four megabytes of data, and the maximum size of the device is limited by the computer's memory size.

为量子和量子集选择适当的值是一个策略问题,而不是机制问题,并且最佳大小取决于设备的使用方式。因此,scull驱动程序不应强制使用量子和量子集大小的任何特定值。在scull中,用户可以通过多种方式更改负责的值:通过在编译时更改宏和 scull.hSCULL_QUANTUMSCULL_QSET通过在 模块加载时 设置整数值,或者通过使用ioctl更改当前值和默认值在运行时。scull_quantumscull_qset

Choosing the appropriate values for the quantum and the quantum set is a question of policy, rather than mechanism, and the optimal sizes depend on how the device is used. Thus, the scull driver should not force the use of any particular values for the quantum and quantum set sizes. In scull, the user can change the values in charge in several ways: by changing the macros SCULL_QUANTUM and SCULL_QSET in scull.h at compile time, by setting the integer values scull_quantum and scull_qset at module load time, or by changing both the current and default values using ioctl at runtime.

使用宏和整数值来允许编译时和加载时配置让人想起如何选择主编号。我们将这种技术用于驱动程序中任意值或与策略相关的值。

Using a macro and an integer value to allow both compile-time and load-time configuration is reminiscent of how the major number is selected. We use this technique for whatever value in the driver is arbitrary or related to policy.

剩下的唯一问题是如何选择默认数字。在这种特殊情况下,问题是在半填充量子和量子集导致的内存浪费与量子和集较小时发生的分配、释放和指针链接的开销之间找到最佳平衡。此外,还应考虑kmalloc的内部设计。(不过,我们现在不追究这个问题;kmalloc的内部结构将在第 8 章中探讨。)默认数字的选择来自于这样的假设:大量数据可能会写入 scull在测试时,尽管正常使用该设备很可能只会传输几千字节的数据。

The only question left is how the default numbers have been chosen. In this particular case, the problem is finding the best balance between the waste of memory resulting from half-filled quanta and quantum sets and the overhead of allocation, deallocation, and pointer chaining that occurs if quanta and sets are small. Additionally, the internal design of kmalloc should be taken into account. (We won't pursue the point now, though; the innards of kmalloc are explored in Chapter 8.) The choice of default numbers comes from the assumption that massive amounts of data are likely to be written to scull while testing it, although normal use of the device will most likely transfer just a few kilobytes of data.

我们已经看到了scull_dev代表我们设备内部的结构。该结构quantumqset字段分别保存设备的量子和量子集大小。然而实际数据是由不同的结构跟踪,我们称之为 struct scull_qset

We have already seen the scull_dev structure that represents our device internally. That structure's quantum and qset fields hold the device's quantum and quantum set sizes, respectively. The actual data, however, is tracked by a different structure, which we call struct scull_qset :

结构 scull_qset {
    无效**数据;
    结构 scull_qset *下一个;
};
struct scull_qset {
    void **data;
    struct scull_qset *next;
};

下一个代码片段在实践中展示了如何使用struct scull_dev和来保存数据。struct scull_qset函数 scull_trim负责释放整个数据区域,并在文件打开写入时由scull_open调用。它只是遍历列表并释放它找到的任何量子和量子集。

The next code fragment shows in practice how struct scull_dev and struct scull_qset are used to hold data. The function scull_trim is in charge of freeing the whole data area and is invoked by scull_open when the file is opened for writing. It simply walks through the list and frees any quantum and quantum set it finds.

int scull_trim(结构 scull_dev *dev)
{
    结构 scull_qset *next, *dptr;
    int qset = dev->qset; /* “dev” 不为空 */
    整数我;

    for (dptr = dev->data; dptr; dptr = next) { /* 所有列表项 */
        if (dptr->数据) {
            for (i = 0; i < qset; i++)
                kfree(dptr->data[i]);
            kfree(dptr->数据);
            dptr->数据=NULL;
        }
        下一个 = dptr->下一个;
        kfree(dptr);
    }
    开发->大小=0;
    dev->quantum = scull_quantum;
    dev->qset = scull_qset;
    dev->数据=NULL;
    返回0;
}
int scull_trim(struct scull_dev *dev)
{
    struct scull_qset *next, *dptr;
    int qset = dev->qset;   /* "dev" is not-null */
    int i;

    for (dptr = dev->data; dptr; dptr = next) { /* all the list items */
        if (dptr->data) {
            for (i = 0; i < qset; i++)
                kfree(dptr->data[i]);
            kfree(dptr->data);
            dptr->data = NULL;
        }
        next = dptr->next;
        kfree(dptr);
    }
    dev->size = 0;
    dev->quantum = scull_quantum;
    dev->qset = scull_qset;
    dev->data = NULL;
    return 0;
}

scull_trim也用于 模块清理函数将scull使用的内存返回给系统。

scull_trim is also used in the module cleanup function to return memory used by scull to the system.

读和写

read and write

读写 方法都有_ _ 执行类似的任务,即将数据从应用程序代码复制到应用程序代码。因此,它们的原型非常相似,值得同时介绍:

The read and write methods both perform a similar task, that is, copying data from and to application code. Therefore, their prototypes are pretty similar, and it's worth introducing them at the same time:

ssize_t read(struct file *filp, char _ _user *buff,
    size_t 计数,loff_t *offp);
ssize_t write(struct file *filp, const char _ _user *buff,
    size_t 计数,loff_t *offp);
ssize_t read(struct file *filp, char _ _user *buff,
    size_t count, loff_t *offp);
ssize_t write(struct file *filp, const char _ _user *buff,
    size_t count, loff_t *offp);

对于这两种方法,filp 是文件指针,count是请求的数据传输的大小。该buff参数指向保存要写入的数据的用户缓冲区或应放置新读取的数据的空缓冲区。最后,offp是一个指向“长偏移类型”对象的指针,该对象指示用户正在访问的文件位置。返回值是一个“有符号大小类型”;其用途稍后讨论。

For both methods, filp is the file pointer and count is the size of the requested data transfer. The buff argument points to the user buffer holding the data to be written or the empty buffer where the newly read data should be placed. Finally, offp is a pointer to a "long offset type" object that indicates the file position the user is accessing. The return value is a "signed size type"; its use is discussed later.

让我们重复一遍,读写方法buff参数 是用户空间指针。因此,它不能被内核代码直接取消引用。造成此限制的原因有以下几个:

Let us repeat that the buff argument to the read and write methods is a user-space pointer. Therefore, it cannot be directly dereferenced by kernel code. There are a few reasons for this restriction:

  • 根据驱动程序运行的体系结构以及内核的配置方式,用户空间指针在内核模式下运行时可能根本无效。该地址可能没有映射,或者它可能指向一些其他随机数据。

  • Depending on which architecture your driver is running on, and how the kernel was configured, the user-space pointer may not be valid while running in kernel mode at all. There may be no mapping for that address, or it could point to some other, random data.

  • 即使指针确实意味着同样的事情内核空间、用户空间内存是分页的,并且在进行系统调用时,相关内存可能并不驻留在 RAM 中。尝试直接引用用户空间内存可能会生成页面错误,这是内核代码不允许执行的操作。结果将是“oops”,这将导致进行系统调用的进程死亡。

  • Even if the pointer does mean the same thing in kernel space, user-space memory is paged, and the memory in question might not be resident in RAM when the system call is made. Attempting to reference the user-space memory directly could generate a page fault, which is something that kernel code is not allowed to do. The result would be an "oops," which would result in the death of the process that made the system call.

  • 有问题的指针是由用户程序提供的,该程序可能有错误或恶意。如果您的驱动程序盲目地取消引用用户提供的指针,则它提供了一个开放的门户,允许用户空间程序访问或覆盖系统中任何位置的内存。如果您不希望对损害用户系统的安全性负责,则永远不能直接取消引用用户空间指针。

  • The pointer in question has been supplied by a user program, which could be buggy or malicious. If your driver ever blindly dereferences a user-supplied pointer, it provides an open doorway allowing a user-space program to access or overwrite memory anywhere in the system. If you do not wish to be responsible for compromising the security of your users' systems, you cannot ever dereference a user-space pointer directly.

显然,您的驱动程序必须能够访问用户空间缓冲区才能完成其工作。然而,为了安全起见,这种访问必须始终由内核提供的特殊函数执行。我们在这里介绍其中一些函数(在<asm/uaccess.h>中定义 ),其余的在第 6.1.4 节中介绍;他们使用一些特殊的、依赖于体系结构的魔法来确保内核和用户空间之间的数据传输以安全和正确的方式进行。

Obviously, your driver must be able to access the user-space buffer in order to get its job done. This access must always be performed by special, kernel-supplied functions, however, in order to be safe. We introduce some of those functions (which are defined in <asm/uaccess.h>) here, and the rest in the Section 6.1.4; they use some special, architecture-dependent magic to ensure that data transfers between kernel and user space happen in a safe and correct way.

scull中 的读写代码需要将一整段数据复制到用户地址空间或从用户地址空间复制出来。此功能由以下内核函数 提供,它们复制任意字节数组并位于大多数读写实现的核心

The code for read and write in scull needs to copy a whole segment of data to or from the user address space. This capability is offered by the following kernel functions, which copy an arbitrary array of bytes and sit at the heart of most read and write implementations:

无符号长 copy_to_user(void __user *to,
                           常量无效*来自,
                           无符号长计数);
无符号长 copy_from_user(void *to,
                             const void _ _user *来自,
                             无符号长计数);
unsigned long copy_to_user(void _ _user *to, 
                           const void *from, 
                           unsigned long count);
unsigned long copy_from_user(void *to, 
                             const void _ _user *from, 
                             unsigned long count);

尽管这些函数的行为类似于普通的memcpy函数,但从内核代码访问用户空间时必须格外小心。正在寻址的用户页面当前可能不存在于内存中,并且虚拟内存子系统可以在页面传输到位时使进程进入睡眠状态。例如,当必须从交换空间检索页面时,就会发生这种情况。驱动程序编写者的最终结果是任何访问用户空间的函数都必须是可重入的,必须能够与其他驱动程序函数同时执行,特别是必须处于可以合法休眠的位置。我们将在第 5 章中回到这个主题。

Although these functions behave like normal memcpy functions, a little extra care must be used when accessing user space from kernel code. The user pages being addressed might not be currently present in memory, and the virtual memory subsystem can put the process to sleep while the page is being transferred into place. This happens, for example, when the page must be retrieved from swap space. The net result for the driver writer is that any function that accesses user space must be reentrant, must be able to execute concurrently with other driver functions, and, in particular, must be in a position where it can legally sleep. We return to this subject in Chapter 5.

这两个函数的作用不仅限于将数据复制到用户空间或从用户空间复制数据:它们还检查用户空间指针是否有效。如果指针无效,则不执行复制;另一方面,如果在复制过程中遇到无效地址,则仅复制部分数据。在这两种情况下,返回值都是仍要复制的内存量。scull代码会查找此错误返回, -EFAULT如果不是则返回给用户0

The role of the two functions is not limited to copying data to and from user-space: they also check whether the user space pointer is valid. If the pointer is invalid, no copy is performed; if an invalid address is encountered during the copy, on the other hand, only part of the data is copied. In both cases, the return value is the amount of memory still to be copied. The scull code looks for this error return, and returns -EFAULT to the user if it's not 0.

用户空间访问和无效用户空间指针的主题有些高级,将在第 6 章中讨论。然而,值得注意的是,如果您不需要检查用户空间指针,您可以调用__copy_to_user__copy_from_user来代替。例如,如果您知道您已经检查了参数,那么这很有用。但要小心;事实上,如果您不检查传递给这些函数的用户空间指针,那么您可能会造成内核崩溃和/或安全漏洞。

The topic of user-space access and invalid user space pointers is somewhat advanced and is discussed in Chapter 6. However, it's worth noting that if you don't need to check the user-space pointer you can invoke _ _copy_to_user and _ _copy_from_user instead. This is useful, for example, if you know you already checked the argument. Be careful, however; if, in fact, you do not check a user-space pointer that you pass to these functions, then you can create kernel crashes and/or security holes.

就实际的设备方法而言, 读取方法的任务是将数据从设备复制到用户空间(使用 copy_to_user),而写入方法必须将数据从用户空间复制到设备(使用copy_from_user)。每个 读取写入系统调用都请求传输特定数量的字节,但驱动程序可以自由地传输较少的数据 - 读取和写入的确切规则略有不同,将在本章后面进行描述。

As far as the actual device methods are concerned, the task of the read method is to copy data from the device to user space (using copy_to_user), while the write method must copy data from user space to the device (using copy_from_user). Each read or write system call requests transfer of a specific number of bytes, but the driver is free to transfer less data—the exact rules are slightly different for reading and writing and are described later in this chapter.

无论方法传输多少数据量,它们通常应该*offp在成功完成系统调用后更新文件位置以表示当前文件位置。然后,内核file在适当的时候将文件位置更改传播回结构中。然而,pread和 pwrite系统调用具有不同的语义;它们从给定的文件偏移量开始操作,并且不会更改任何其他系统调用所看到的文件位置。这些调用传递一个指向用户提供的位置的指针,并丢弃驱动程序所做的更改。

Whatever the amount of data the methods transfer, they should generally update the file position at *offp to represent the current file position after successful completion of the system call. The kernel then propagates the file position change back into the file structure when appropriate. The pread and pwrite system calls have different semantics, however; they operate from a given file offset and do not change the file position as seen by any other system calls. These calls pass in a pointer to the user-supplied position, and discard the changes that your driver makes.

图 3-2展示了典型的 读取实现如何使用其参数。

Figure 3-2 represents how a typical read implementation uses its arguments.

需要阅读的论据

图 3-2。需要阅读的论据

Figure 3-2. The arguments to read

如果发生错误,读取 和写入方法都会返回负值。相反,大于或等于 0 的返回值告诉调用程序已成功传输了多少字节。如果某些数据传输正确,然后发生错误,则返回值必须是成功传输的字节数,并且直到下次调用该函数时才会报告错误。当然,实现此约定需要您的驱动程序记住已发生错误,以便将来可以返回错误状态。

Both the read and write methods return a negative value if an error occurs. A return value greater than or equal to 0, instead, tells the calling program how many bytes have been successfully transferred. If some data is transferred correctly and then an error happens, the return value must be the count of bytes successfully transferred, and the error does not get reported until the next time the function is called. Implementing this convention requires, of course, that your driver remember that the error has occurred so that it can return the error status in the future.

尽管内核函数返回一个负数来表示错误,并且该数字的值指示发生的错误类型(如第2 章所述),但在用户空间中运行的程序始终将其视为 -1错误返回值。他们需要访问该 errno变量以了解发生了什么。用户空间行为由 POSIX 标准规定,但该标准并未对内核内部运行方式提出要求。

Although kernel functions return a negative number to signal an error, and the value of the number indicates the kind of error that occurred (as introduced in Chapter 2), programs that run in user space always see -1 as the error return value. They need to access the errno variable to find out what happened. The user-space behavior is dictated by the POSIX standard, but that standard does not make requirements on how the kernel operates internally.

读取方法

The read Method

read的返回值由调用应用程序解释:

The return value for read is interpreted by the calling application program:

  • 如果该值等于count传递给read系统调用的参数,则已传输请求的字节数。这是最佳情况。

  • If the value equals the count argument passed to the read system call, the requested number of bytes has been transferred. This is the optimal case.

  • 如果该值为正数,但小于count,则仅传输了部分数据。发生这种情况的原因有多种,具体取决于设备。大多数情况下,应用程序会重试读取。例如,如果您使用fread 函数进行读取,则库函数会重新发出系统调用,直到完成请求的数据传输。

  • If the value is positive, but smaller than count, only part of the data has been transferred. This may happen for a number of reasons, depending on the device. Most often, the application program retries the read. For instance, if you read using the fread function, the library function reissues the system call until completion of the requested data transfer.

  • 如果值为0,则已到达文件末尾(并且未读取任何数据)。

  • If the value is 0, end-of-file was reached (and no data was read).

  • 负值意味着存在错误。该值根据<linux/errno.h>指定错误是什么。错误时返回的典型值包括-EINTR (中断的系统调用)或-EFAULT(错误的地址)。

  • A negative value means there was an error. The value specifies what the error was, according to <linux/errno.h>. Typical values returned on error include -EINTR (interrupted system call) or -EFAULT (bad address).

前面的列表中缺少的是“没有数据,但可能稍后到达”的情况。在这种情况下,读取系统调用应该被阻止。我们将在第 6 章中处理阻塞输入。

What is missing from the preceding list is the case of "there is no data, but it may arrive later." In this case, the read system call should block. We'll deal with blocking input in Chapter 6.

scull代码利用了这些规则。特别是,它利用了部分读取规则。每次调用 scull_read仅处理单个数据量,而不实现循环来收集所有数据;这使得代码更短并且更容易阅读。如果读取程序确实需要更多数据,则会重复调用。如果使用标准 I/O 库(即fread)来读取设备,应用程序甚至不会注意到数据传输的量化。

The scull code takes advantage of these rules. In particular, it takes advantage of the partial-read rule. Each invocation of scull_read deals only with a single data quantum, without implementing a loop to gather all the data; this makes the code shorter and easier to read. If the reading program really wants more data, it reiterates the call. If the standard I/O library (i.e., fread) is used to read the device, the application won't even notice the quantization of the data transfer.

如果当前读取位置大于设备大小,则 scullread方法将返回信号,表明没有可用数据(换句话说,我们位于文件末尾)。如果进程 A 正在读取设备,而进程 B 打开设备进行写入,从而将设备的长度截断为 0,则可能会发生这种情况。进程 A 突然发现自己超出了文件结尾,并且下一个读取调用返回。00

If the current read position is greater than the device size, the read method of scull returns 0 to signal that there's no data available (in other words, we're at end-of-file). This situation can happen if process A is reading the device while process B opens it for writing, thus truncating the device to a length of 0. Process A suddenly finds itself past end-of-file, and the next read call returns 0.

下面是read的代码(暂时忽略对down_interruptibleup的调用;我们将在下一章中讨论它们):

Here is the code for read (ignore the calls to down_interruptible and up for now; we will get to them in the next chapter):

ssize_t scull_read(struct file *filp, char _ _user *buf, size_t count,
                loff_t *f_pos)
{
    struct scull_dev *dev = filp->private_data;
    结构 scull_qset *dptr; /* 第一个列表项 */
    int 量子,qset;
    int 项目大小;/* 列表项有多少字节 */
    int item,s_pos,q_pos,其余;
    ssize_t retval = 0;

    if (down_interruptible(&dev->sem))
        返回-ERESTARTSYS;
    量子 = dev->量子;
    qset = dev->qset;
    项目大小=量子*qset;
        
    if (*f_pos >= dev->size)
       转到出去;
    if (*f_pos + 计数 > dev->size)
       计数 = dev->大小 - *f_pos;

    /* 查找列表项、qset 索引和量程中的偏移量 */
    item = (long)*f_pos / itemsize;
    休息 = (长)*f_pos % 项目大小;
    s_pos = 休息 / 量子; q_pos = 剩余% 量子;

    /* 跟随列表直到正确的位置(在别处定义) */
    dptr = scull_follow(dev, 项目);

    if (dptr == NULL || !dptr->data || !dptr->data[s_pos])
        转到出去;/* 不填坑 */

    /* 只读到该量程的末尾 */
    if (计数 > 量子 - q_pos)
        计数=量子-q_pos;

    if (copy_to_user(buf, dptr->data[s_pos] + q_pos, count)) {
        retval = -EFAULT;
        转到出去;
    }
    *f_pos += 计数;
    retval = 计数;

  出去:
    向上(&dev->sem);
    返回retval;
}
ssize_t scull_read(struct file *filp, char _ _user *buf, size_t count,
                loff_t *f_pos)
{
    struct scull_dev *dev = filp->private_data; 
    struct scull_qset *dptr;    /* the first listitem */
    int quantum, qset;
    int itemsize; /* how many bytes in the listitem */
    int item, s_pos, q_pos, rest;
    ssize_t retval = 0;

    if (down_interruptible(&dev->sem))
        return -ERESTARTSYS;
    quantum = dev->quantum;
    qset = dev->qset;
    itemsize = quantum*qset;
        
    if (*f_pos >= dev->size)
       goto out;
    if (*f_pos + count > dev->size)
       count = dev->size - *f_pos;

    /* find listitem, qset index, and offset in the quantum */
    item = (long)*f_pos / itemsize;
    rest = (long)*f_pos % itemsize;
    s_pos = rest / quantum; q_pos = rest % quantum;

    /* follow the list up to the right position (defined elsewhere) */
    dptr = scull_follow(dev, item);

    if (dptr =  = NULL || !dptr->data || ! dptr->data[s_pos])
        goto out; /* don't fill holes */

    /* read only up to the end of this quantum */
    if (count > quantum - q_pos)
        count = quantum - q_pos;

    if (copy_to_user(buf, dptr->data[s_pos] + q_pos, count)) {
        retval = -EFAULT;
        goto out;
    }
    *f_pos += count;
    retval = count;

  out:
    up(&dev->sem);
    return retval;
}

写入方法

The write Method

write与read一样,可以根据以下返回值规则传输比请求的数据少的数据:

write, like read, can transfer less data than was requested, according to the following rules for the return value:

  • 如果该值等于count,则已传输请求的字节数。

  • If the value equals count, the requested number of bytes has been transferred.

  • 如果该值为正数,但小于count,则仅传输了部分数据。该程序很可能会重试写入其余数据。

  • If the value is positive, but smaller than count, only part of the data has been transferred. The program will most likely retry writing the rest of the data.

  • 如果值为0,则未写入任何内容。这个结果不是错误,并且没有理由返回错误代码。标准库再次重试调用write我们将在第 6 章中研究这种情况的确切含义,其中介绍了阻塞写入。

  • If the value is 0, nothing was written. This result is not an error, and there is no reason to return an error code. Once again, the standard library retries the call to write. We'll examine the exact meaning of this case in Chapter 6, where blocking write is introduced.

  • 负值表示发生错误;至于read ,有效的错误值是<linux/errno.h>中定义的值。

  • A negative value means an error occurred; as for read, valid error values are those defined in <linux/errno.h>.

不幸的是,仍然可能存在行为不当的程序,它们会在执行部分传输时发出错误消息并中止。发生这种情况是因为一些程序员习惯于看到写入调用要么失败要么完全成功,这实际上是大多数时候发生的情况,并且设备也应该支持。scull实现中的这一限制可以修复,但我们不想让代码变得过于复杂。

Unfortunately, there may still be misbehaving programs that issue an error message and abort when a partial transfer is performed. This happens because some programmers are accustomed to seeing write calls that either fail or succeed completely, which is actually what happens most of the time and should be supported by devices as well. This limitation in the scull implementation could be fixed, but we didn't want to complicate the code more than necessary.

用于 write的scull代码一次处理一个量子,就像 read方法一样:

The scull code for write deals with a single quantum at a time, as the read method does:

ssize_t scull_write(struct file *filp, const char _ _user *buf, size_t count,
                loff_t *f_pos)
{
    struct scull_dev *dev = filp->private_data;
    结构 scull_qset *dptr;
    int 量子= dev->量子,qset = dev->qset;
    int itemsize = 量子 * qset;
    int item,s_pos,q_pos,其余;
    ssize_t retval = -ENOMEM;/* “goto out” 语句中使用的值 */

    if (down_interruptible(&dev->sem))
        返回-ERESTARTSYS;

    /* 在量程中查找 listitem、qset 索引和偏移量 */
    item = (long)*f_pos / itemsize;
    休息 = (长)*f_pos % 项目大小;
    s_pos = 休息 / 量子; q_pos = 剩余% 量子;

    /* 跟随列表直到正确的位置 */
    dptr = scull_follow(dev, 项目);
    如果(dptr == NULL)
        转到出去;
    if (!dptr->data) {
        dptr->data = kmalloc(qset * sizeof(char *), GFP_KERNEL);
        if (!dptr->数据)
            转到出去;
        memset(dptr->data, 0, qset * sizeof(char *));
    }
    if (!dptr->data[s_pos]) {
        dptr->data[s_pos] = kmalloc(量子, GFP_KERNEL);
        if (!dptr->data[s_pos])
            转到出去;
    }
    /* 只写到这个量程的末尾 */
    if (计数 > 量子 - q_pos)
        计数=量子-q_pos;

    if (copy_from_user(dptr->data[s_pos]+q_pos, buf, count)) {
        retval = -EFAULT;
        转到出去;
    }
    *f_pos += 计数;
    retval = 计数;

        /* 更新大小 */
    if (dev->size < *f_pos)
        dev->size = *f_pos;

  出去:
    向上(&dev->sem);
    返回retval;






}
ssize_t scull_write(struct file *filp, const char _ _user *buf, size_t count,
                loff_t *f_pos)
{
    struct scull_dev *dev = filp->private_data;
    struct scull_qset *dptr;
    int quantum = dev->quantum, qset = dev->qset;
    int itemsize = quantum * qset;
    int item, s_pos, q_pos, rest;
    ssize_t retval = -ENOMEM; /* value used in "goto out" statements */

    if (down_interruptible(&dev->sem))
        return -ERESTARTSYS;

    /* find listitem, qset index and offset in the quantum */
    item = (long)*f_pos / itemsize;
    rest = (long)*f_pos % itemsize;
    s_pos = rest / quantum; q_pos = rest % quantum;

    /* follow the list up to the right position */
    dptr = scull_follow(dev, item);
    if (dptr =  = NULL)
        goto out;
    if (!dptr->data) {
        dptr->data = kmalloc(qset * sizeof(char *), GFP_KERNEL);
        if (!dptr->data)
            goto out;
        memset(dptr->data, 0, qset * sizeof(char *));
    }
    if (!dptr->data[s_pos]) {
        dptr->data[s_pos] = kmalloc(quantum, GFP_KERNEL);
        if (!dptr->data[s_pos])
            goto out;
    }
    /* write only up to the end of this quantum */
    if (count > quantum - q_pos)
        count = quantum - q_pos;

    if (copy_from_user(dptr->data[s_pos]+q_pos, buf, count)) {
        retval = -EFAULT;
        goto out;
    }
    *f_pos += count;
    retval = count;

        /* update the size */
    if (dev->size < *f_pos)
        dev->size = *f_pos;

  out:
    up(&dev->sem);
    return retval;






}

读v和写v

readv and writev

Unix系统很早就支持两种 名为readvwritev 的系统调用 。这些“向量”版本的 读取写入采用一个结构数组,每个结构都包含一个指向缓冲区的指针和一个长度值。然后, readv调用 将依次将指示的数量读入每个缓冲区。相反, writev会将每个缓冲区的内容收集在一起,并将它们作为单个写入操作输出。

Unix systems have long supported two system calls named readv and writev. These "vector" versions of read and write take an array of structures, each of which contains a pointer to a buffer and a length value. A readv call would then be expected to read the indicated amount into each buffer in turn. writev, instead, would gather together the contents of each buffer and put them out as a single write operation.

如果您的驱动程序没有提供处理该问题的方法 向量操作、readvwritev是通过多次调用读取写入 方法来实现的。然而,在许多情况下,通过直接实现readvwritev可以获得更高的效率 。

If your driver does not supply methods to handle the vector operations, readv and writev are implemented with multiple calls to your read and write methods. In many situations, however, greater efficiency is acheived by implementing readv and writev directly.

向量运算的原型是:

The prototypes for the vector operations are:

ssize_t (*readv) (struct file *filp, const struct iovec *iov,
                  无符号长计数,loff_t *ppos);
ssize_t (*writev) (struct file *filp, const struct iovec *iov,
                  无符号长计数,loff_t *ppos);
ssize_t (*readv) (struct file *filp, const struct iovec *iov, 
                  unsigned long count, loff_t *ppos);
ssize_t (*writev) (struct file *filp, const struct iovec *iov, 
                  unsigned long count, loff_t *ppos);

这里,filp和参数与readwriteppos相同。该结构在<linux/uio.h>中定义,如下所示:iovec

Here, the filp and ppos arguments are the same as for read and write. The iovec structure, defined in <linux/uio.h>, looks like:

结构iovec
{
    void _ _user *iov_base;
    __kernel_size_t iov_len;
};
struct iovec
{
    void _  _user *iov_base;
    _ _kernel_size_t iov_len;
};

每个iovec描述了要传输的一块数据;iov_base它从(在用户空间中)开始,iov_len长度为字节。该count参数告诉该方法iovec有多少个结构。这些结构由应用程序创建,但内核在调用驱动程序之前将它们复制到内核空间。

Each iovec describes one chunk of data to be transferred; it starts at iov_base (in user space) and is iov_len bytes long. The count parameter tells the method how many iovec structures there are. These structures are created by the application, but the kernel copies them into kernel space before calling the driver.

矢量操作的最简单实现是一个简单的循环,只需将每个操作的地址和长度传递iovec给驱动程序的读取写入 函数。然而,高效且正确的行为通常需要驾驶员采取更聪明的措施。例如,磁带驱动器上的writev应该将所有结构的内容iovec作为单个记录写入磁带上。

The simplest implementation of the vectored operations would be a straightforward loop that just passes the address and length out of each iovec to the driver's read or write function. Often, however, efficient and correct behavior requires that the driver do something smarter. For example, a writev on a tape drive should write the contents of all the iovec structures as a single record on the tape.

然而,许多驱动程序本身并没有从实施这些方法中获得任何好处。因此,scull忽略它们。内核用 readwrite来模拟它们,最终结果是相同的。

Many drivers, however, gain no benefit from implementing these methods themselves. Therefore, scull omits them. The kernel emulates them with read and write, and the end result is the same.

使用新设备

Playing with the New Devices

一旦你装备了四件 通过以上方法,即可编译并测试驱动程序;它会保留您写入的任何数据,直到您用新数据覆盖它为止。该设备就像一个数据缓冲区,其长度仅受可用实际 RAM 数量的限制。您可以尝试使用cpdd和输入/输出重定向来测试驱动程序。

Once you are equipped with the four methods just described, the driver can be compiled and tested; it retains any data you write to it until you overwrite it with new data. The device acts like a data buffer whose length is limited only by the amount of real RAM available. You can try using cp, dd, and input/output redirection to test out the driver.

free命令可用于查看空闲内存量如何根据写入scull的数据量而收缩和扩展。

The free command can be used to see how the amount of free memory shrinks and expands according to how much data is written into scull.

为了更有信心一次读取和写入一个量子,您可以 在驱动程序中的适当位置添加printk ,并观察应用程序读取或写入大量数据时会发生什么。或者,使用 strace实用程序监视程序发出的系统调用及其返回值。跟踪cpls -l > /dev/scull0显示量化的读取和写入。第 4 章详细介绍了监控(和调试)技术

To get more confident with reading and writing one quantum at a time, you can add a printk at an appropriate point in the driver and watch what happens while an application reads or writes large chunks of data. Alternatively, use the strace utility to monitor the system calls issued by a program, together with their return values. Tracing a cp or an ls -l > /dev/scull0 shows quantized reads and writes. Monitoring (and debugging) techniques are presented in detail in Chapter 4

快速参考

Quick Reference

本章介绍了以下符号和头文件。struct file_operations和中的字段列表struct file这里不再重复。

This chapter introduced the following symbols and header files. The list of the fields in struct file_operations and struct file is not repeated here.

#include <linux/types.h>

dev_t
#include <linux/types.h>

dev_t

dev_t是用于 r 的类型 表示内核中的设备号。

dev_t is the type used to r epresent device numbers within the kernel.

int MAJOR(dev_t dev);

int MINOR(dev_t dev);
int MAJOR(dev_t dev);

int MINOR(dev_t dev);

从设备编号中提取主要编号和次要编号的宏。

Macros that extract the major and minor numbers from a device number.

dev_t MKDEV(unsigned int major, unsigned int minor);
dev_t MKDEV(unsigned int major, unsigned int minor);

dev_t从主要数字和次要数字构建数据项的宏。

Macro that builds a dev_t data item from the major and minor numbers.

#include <linux/fs.h>
#include <linux/fs.h>

“filesystem”头是编写设备驱动程序所需的头。许多重要的函数和数据结构都在这里声明。

The "filesystem" header is the header required for writing device drivers. Many important functions and data structures are declared in here.

int register_chrdev_region(dev_t first, unsigned int count, char *name)

int alloc_chrdev_region(dev_t *dev, unsigned int firstminor, unsigned int

count, char *name)

void unregister_chrdev_region(dev_t first, unsigned int count);
int register_chrdev_region(dev_t first, unsigned int count, char *name)

int alloc_chrdev_region(dev_t *dev, unsigned int firstminor, unsigned int

count, char *name)

void unregister_chrdev_region(dev_t first, unsigned int count);

允许驱动程序分配和释放设备编号范围的函数。 当预先知道所需的主设备号时,应使用register_chrdev_region ;对于动态分配,请改用 alloc_chrdev_region

Functions that allow a driver to allocate and free ranges of device numbers. register_chrdev_region should be used when the desired major number is known in advance; for dynamic allocation, use alloc_chrdev_region instead.

int register_chrdev(unsigned int major, const char *name, struct file_operations

*fops);
int register_chrdev(unsigned int major, const char *name, struct file_operations

*fops);

旧的(2.6 之前的)字符设备注册例程。它在 2.6 内核中被模拟,但不应用于新代码。如果主设备号不为0,则原样使用;否则会为此设备分配一个动态编号。

The old (pre-2.6) char device registration routine. It is emulated in the 2.6 kernel but should not be used for new code. If the major number is not 0, it is used unchanged; otherwise a dynamic number is assigned for this device.

int unregister_chrdev(unsigned int major, const char *name);
int unregister_chrdev(unsigned int major, const char *name);

撤消使用register_chrdev进行的注册的函数 。两者majorname字符串必须包含用于注册驱动程序的相同值。

Function that undoes a registration made with register_chrdev. Both major and the name string must contain the same values that were used to register the driver.

struct file_operations;

struct file;

struct inode;
struct file_operations;

struct file;

struct inode;

大多数设备驱动程序使用的三种重要数据结构。该file_operations结构包含字符驱动程序的方法; struct file代表一个打开的文件,struct inode代表磁盘上的一个文件。

Three important data structures used by most device drivers. The file_operations structure holds a char driver's methods; struct file represents an open file, and struct inode represents a file on disk.

#include <linux/cdev.h>

struct cdev *cdev_alloc(void);

void cdev_init(struct cdev *dev, struct file_operations *fops);

int cdev_add(struct cdev *dev, dev_t num, unsigned int count);

void cdev_del(struct cdev *dev);
#include <linux/cdev.h>

struct cdev *cdev_alloc(void);

void cdev_init(struct cdev *dev, struct file_operations *fops);

int cdev_add(struct cdev *dev, dev_t num, unsigned int count);

void cdev_del(struct cdev *dev);

用于管理cdev 结构的函数,这些结构代表内核中的字符设备。

Functions for the management of cdev structures, which represent char devices within the kernel.

#include <linux/kernel.h>

container_of(pointer, type, field);
#include <linux/kernel.h>

container_of(pointer, type, field);

一个方便的宏,可用于从指向结构中包含的某些其他结构的指针获取指向该结构的指针。

A convenience macro that may be used to obtain a pointer to a structure from a pointer to some other structure contained within it.

#include <asm/uaccess.h>
#include <asm/uaccess.h>

该包含文件声明了内核代码用于将数据移入和移出用户空间的函数。

This include file declares functions used by kernel code to move data to and from user space.

unsigned long copy_from_user (void *to, const void *from, unsigned long

count);

unsigned long copy_to_user (void *to, const void *from, unsigned long count);
unsigned long copy_from_user (void *to, const void *from, unsigned long

count);

unsigned long copy_to_user (void *to, const void *from, unsigned long count);

在用户空间和内核空间之间复制数据。

Copy data between user space and kernel space.




[ 1 ]通常可以从 sysfs 获得更好的设备信息,通常安装在基于 2.6 的系统上的/sys上。然而,让 scull通过 sysfs 导出信息超出了本章的范围;我们将在第 14 章中回到这个主题。

[1] Even better device information can usually be obtained from sysfs, generally mounted on /sys on 2.6-based systems. Getting scull to export information via sysfs is beyond the scope of this chapter, however; we'll return to this topic in Chapter 14.

[ 2 ] Linux Standard Base 指定 init 脚本应放置在/etc/init.d中,但某些发行版仍然将它们放置在其他位置。此外,如果您的脚本要在引导时运行,您需要从适当的运行级别目录(即.../rc3.d)建立到它的链接。

[2] The Linux Standard Base specifies that init scripts should be placed in /etc/init.d, but some distributions still place them elsewhere. In addition, if your script is to be run at boot time, you need to make a link to it from the appropriate run-level directory (i.e., .../rc3.d).

[ 3 ]尽管某些内核开发人员威胁说将来会这样做。

[3] Though certain kernel developers have threatened to do exactly that in the future.

[ 4 ] init 脚本scull.init不接受命令行上的驱动程序选项,但它支持配置文件,因为它设计为在启动和关闭时自动使用。

[4] The init script scull.init doesn't accept driver options on the command line, but it supports a configuration file, because it's designed for automatic use at boot and shutdown time.

[ 5 ]请注意,并不是每次进程调用close时都会调用release。每当共享结构时(例如,在 forkdup之后), 在关闭所有副本之前不会调用release 。如果您需要在任何副本关闭时刷新挂起的数据,您应该实现 flush方法。file

[5] Note that release isn't invoked every time a process calls close. Whenever a file structure is shared (for example, after a fork or a dup), release won't be invoked until all copies are closed. If you need to flush pending data when any copy is closed, you should implement the flush method.

[ 6 ]有一种较旧的机制可以避免使用cdev结构(我们将在第 3.4.2 节中讨论)。然而,新代码应该使用更新的技术。

[6] There is an older mechanism that avoids the use of cdev structures (which we discuss in Section 3.4.2). New code should use the newer technique, however.

[ 7 ]设备的其他类型由不同的函数关闭,因为 scull_open为每个设备替换了不同的函数filp->f_op。我们将在介绍每种口味时讨论这些内容。

[7] The other flavors of the device are closed by different functions because scull_open substituted a different filp->f_op for each device. We'll discuss these as we introduce each flavor.

第 4 章调试技术

Chapter 4. Debugging Techniques

内核编程带来了其独特的调试挑战。内核代码不能在调试器下轻松执行,也不能轻松跟踪,因为它是一组与特定进程无关的功能。内核代码错误也可能非常难以重现,并且可能导致整个系统瘫痪,从而破坏了许多可用于追踪它们的证据。

Kernel programming brings its own, unique debugging challenges. Kernel code cannot be easily executed under a debugger, nor can it be easily traced, because it is a set of functionalities not related to a specific process. Kernel code errors can also be exceedingly hard to reproduce and can bring down the entire system with them, thus destroying much of the evidence that could be used to track them down.

本章介绍了在这种困难的情况下可以用来监视内核代码和跟踪错误的技术。

This chapter introduces techniques you can use to monitor kernel code and trace errors under such trying circumstances.

内核中的调试支持

Debugging Support in the Kernel

第 2 章中,我们建议 你建造并安装您自己的内核,而不是运行您的发行版附带的库存内核。运行自己的内核的最重要原因之一是内核开发人员已在内核本身中内置了多个调试功能。这些功能会产生额外的输出并降低性能,因此它们往往不会在发行商的生产内核中启用。然而,作为内核开发人员,您有不同的优先级,并且会很乐意接受额外内核调试支持的(最小)开销。

In Chapter 2, we recommended that you build and install your own kernel, rather than running the stock kernel that comes with your distribution. One of the strongest reasons for running your own kernel is that the kernel developers have built several debugging features into the kernel itself. These features can create extra output and slow performance, so they tend not to be enabled in production kernels from distributors. As a kernel developer, however, you have different priorities and will gladly accept the (minimal) overhead of the extra kernel debugging support.

在这里,我们列出了配置选项 应该为用于开发的内核启用。除非另有说明,所有这些选项都可以在您喜欢的任何内核配置工具的“内核黑客”菜单下找到。请注意,并非所有体系结构都支持其中一些选项。

Here, we list the configuration options that should be enabled for kernels used for development. Except where specified otherwise, all of these options are found under the "kernel hacking" menu in whatever kernel configuration tool you prefer. Note that some of these options are not supported by all architectures.

CONFIG_DEBUG_KERNEL
CONFIG_DEBUG_KERNEL

该选项只是使其他调试选项可用;它应该被打开,但本身并不启用任何功能。

This option just makes other debugging options available; it should be turned on but does not, by itself, enable any features.

CONFIG_DEBUG_SLAB
CONFIG_DEBUG_SLAB

这个关键选项会在内核内存分配函数中开启多种类型的检查;启用这些检查后,可以检测到许多内存溢出和丢失初始化错误。0xa5分配的内存的每个字节在传递给调用者之前设置为,然后设置为0x6b当它被释放时。如果您在驱动程序的输出中(或者经常在 oops 列表中)看到这些“有毒”模式中的任何一个重复出现,您就会确切地知道要查找哪种错误。当启用调试时,内核还会在每个分配的内存对象之前和之后放置特殊的保护值;如果这些值发生变化,内核就会知道有人超出了内存分配,并且会大声抱怨。还启用了针对更模糊错误的各种检查。

This crucial option turns on several types of checks in the kernel memory allocation functions; with these checks enabled, it is possible to detect a number of memory overrun and missing initialization errors. Each byte of allocated memory is set to 0xa5 before being handed to the caller and then set to 0x6b when it is freed. If you ever see either of those "poison" patterns repeating in output from your driver (or often in an oops listing), you'll know exactly what sort of error to look for. When debugging is enabled, the kernel also places special guard values before and after every allocated memory object; if those values ever get changed, the kernel knows that somebody has overrun a memory allocation, and it complains loudly. Various checks for more obscure errors are enabled as well.

CONFIG_DEBUG_PAGEALLOC
CONFIG_DEBUG_PAGEALLOC

释放时,整页将从内核地址空间中删除。此选项可以显着减慢速度,但它也可以快速指出某些类型的内存损坏错误。

Full pages are removed from the kernel address space when freed. This option can slow things down significantly, but it can also quickly point out certain kinds of memory corruption errors.

CONFIG_DEBUG_SPINLOCK
CONFIG_DEBUG_SPINLOCK

启用此选项后,内核会捕获对未初始化的自旋锁的操作以及各种其他错误(例如两次解锁锁)。

With this option enabled, the kernel catches operations on uninitialized spinlocks and various other errors (such as unlocking a lock twice).

CONFIG_DEBUG_SPINLOCK_SLEEP
CONFIG_DEBUG_SPINLOCK_SLEEP

此选项允许检查在持有自旋锁时尝试休眠的情况。事实上,如果您调用一个可能会休眠的函数,即使相关调用不会休眠,它也会抱怨。

This option enables a check for attempts to sleep while holding a spinlock. In fact, it complains if you call a function that could potentially sleep, even if the call in question would not sleep.

CONFIG_INIT_DEBUG
CONFIG_INIT_DEBUG

标有_ _init(或_ _initdata) 的项目将在系统初始化或模块加载时间后被丢弃。此选项启用在初始化完成后检查尝试访问初始化时内存的代码。

Items marked with _ _init (or _ _initdata) are discarded after system initialization or module load time. This option enables checks for code that attempts to access initialization-time memory after initialization is complete.

CONFIG_DEBUG_INFO
CONFIG_DEBUG_INFO

此选项会导致构建内核时包含完整的调试信息。如果您想使用gdb调试内核,您将需要该信息 。CONFIG_FRAME_POINTER如果您打算使用gdb,您可能还需要启用。

This option causes the kernel to be built with full debugging information included. You'll need that information if you want to debug the kernel with gdb. You may also want to enable CONFIG_FRAME_POINTER if you plan to use gdb.

CONFIG_MAGIC_SYSRQ
CONFIG_MAGIC_SYSRQ

启用“神奇 SysRq”键。我们将在本章后面的4.5.2 节中查看这个键。

Enables the "magic SysRq" key. We look at this key in Section 4.5.2 later in this chapter.

CONFIG_DEBUG_STACKOVERFLOW

CONFIG_DEBUG_STACK_USAGE
CONFIG_DEBUG_STACKOVERFLOW

CONFIG_DEBUG_STACK_USAGE

这些选项可以帮助跟踪内核堆栈溢出。堆栈溢出的一个明确标志是没有任何合理回溯的 oops 列表。第一个选项向内核添加显式溢出检查;第二个使内核监视堆栈使用情况并通过神奇的 SysRq 键提供一些统计信息。

These options can help track down kernel stack overflows. A sure sign of a stack overflow is an oops listing without any sort of reasonable back trace. The first option adds explicit overflow checks to the kernel; the second causes the kernel to monitor stack usage and make some statistics available via the magic SysRq key.

CONFIG_KALLSYMS
CONFIG_KALLSYMS

此选项(在“常规设置/标准功能”下)会导致内核符号信息内置到内核中;默认情况下它是启用的。符号信息用于调试上下文;如果没有它,oops 列表只能为您提供十六进制的内核回溯,这不是很有用。

This option (under "General setup/Standard features") causes kernel symbol information to be built into the kernel; it is enabled by default. The symbol information is used in debugging contexts; without it, an oops listing can give you a kernel traceback only in hexadecimal, which is not very useful.

CONFIG_IKCONFIG

CONFIG_IKCONFIG_PROC
CONFIG_IKCONFIG

CONFIG_IKCONFIG_PROC

这些选项(在“常规设置”菜单中找到)使完整的内核配置状态内置到内核中,并通过/proc提供。大多数内核开发人员知道他们使用的是哪种配置,并且不需要这些选项(这会使内核更大)。不过,如果您尝试调试其他人构建的内核中的问题,它们可能会很有用。

These options (found in the "General setup" menu) cause the full kernel configuration state to be built into the kernel and to be made available via /proc. Most kernel developers know which configuration they used and do not need these options (which make the kernel bigger). They can be useful, though, if you are trying to debug a problem in a kernel built by somebody else.

CONFIG_ACPI_DEBUG
CONFIG_ACPI_DEBUG

在“电源管理/ACPI”下。此选项打开详细的 ACPI(高级配置和电源接口)调试信息,如果您怀疑与 ACPI 相关的问题,这会很有用。

Under "Power management/ACPI." This option turns on verbose ACPI (Advanced Configuration and Power Interface) debugging information, which can be useful if you suspect a problem related to ACPI.

CONFIG_DEBUG_DRIVER
CONFIG_DEBUG_DRIVER

在“设备驱动程序”下。打开驱动程序核心中的调试信息,这对于跟踪低级支持代码中的问题非常有用。我们将在第 14 章中讨论驱动程序核心。

Under "Device drivers." Turns on debugging information in the driver core, which can be useful for tracking down problems in the low-level support code. We'll look at the driver core in Chapter 14.

CONFIG_SCSI_CONSTANTS
CONFIG_SCSI_CONSTANTS

该选项位于“设备驱动程序/SCSI 设备支持”下,内置了详细 SCSI 错误消息的信息。如果您正在使用 SCSI 驱动程序,您可能需要此选项。

This option, found under "Device drivers/SCSI device support," builds in information for verbose SCSI error messages. If you are working on a SCSI driver, you probably want this option.

CONFIG_INPUT_EVBUG
CONFIG_INPUT_EVBUG

此选项(在“设备驱动程序/输入设备支持”下)打开输入事件的详细日志记录。如果您正在开发输入设备的驱动程序,此选项可能会有所帮助。但是,请注意此选项的安全隐患:它会记录您键入的所有内容,包括您的密码。

This option (under "Device drivers/Input device support") turns on verbose logging of input events. If you are working on a driver for an input device, this option may be helpful. Be aware of the security implications of this option, however: it logs everything you type, including your passwords.

CONFIG_PROFILING
CONFIG_PROFILING

该选项位于“分析支持”下。分析通常用于系统性能调整,但它也可用于跟踪某些内核挂起和相关问题。

This option is found under "Profiling support." Profiling is normally used for system performance tuning, but it can also be useful for tracking down some kernel hangs and related problems.

我们将重新审视 一些 当我们研究追踪内核问题的各种方法时,我们会考虑上述选项。但首先,我们将了解经典的调试技术:打印语句。

We will revisit some of the above options as we look at various ways of tracking down kernel problems. But first, we will look at the classic debugging technique: print statements.

打印调试

Debugging by Printing

最常见的调试 技术就是监控,在应用程序编程中,监控是通过 在适当的位置调用printf来完成的 点。当您调试内核代码时,您可以使用 printk实现相同的目标。

The most common debugging technique is monitoring, which in applications programming is done by calling printf at suitable points. When you are debugging kernel code, you can accomplish the same goal with printk.

打印k

printk

我们使用了printk 前面章节中的函数,简单地假设它的工作方式与 printf类似。现在是时候介绍一些差异了。

We used the printk function in earlier chapters with the simplifying assumption that it works like printf. Now it's time to introduce some of the differences.

区别之一是printk允许您通过关联不同的日志级别 根据消息的严重性对消息进行分类 , 或者优先事项,以及消息。您通常使用宏来指示日志级别。例如,KERN_INFO我们看到它被添加到一些早期的打印语句之前,它是消息可能的日志级别之一。loglevel 宏扩展为 一个字符串,在编译时连接到消息文本;这就是为什么在以下示例中优先级和格式字符串之间没有逗号。以下是printk命令的两个示例,一条是调试消息,一条是关键消息:

One of the differences is that printk lets you classify messages according to their severity by associating different loglevels , or priorities, with the messages. You usually indicate the loglevel with a macro. For example, KERN_INFO, which we saw prepended to some of the earlier print statements, is one of the possible loglevels of the message. The loglevel macro expands to a string, which is concatenated to the message text at compile time; that's why there is no comma between the priority and the format string in the following examples. Here are two examples of printk commands, a debug message and a critical message:

printk(KERN_DEBUG "我在这里:%s:%i\n", _ _FILE_ _, _ _LINE_ _);
printk(KERN_CRIT "我被毁了;放弃%p\n", ptr);
printk(KERN_DEBUG "Here I am: %s:%i\n", _ _FILE_ _, _ _LINE_ _);
printk(KERN_CRIT "I'm trashed; giving up on %p\n", ptr);

有八种可能的日志级别字符串,在头文件<linux/kernel.h>中定义;我们按照严重程度递减的顺序列出它们:

There are eight possible loglevel strings, defined in the header <linux/kernel.h>; we list them in order of decreasing severity:

KERN_EMERG
KERN_EMERG

用于紧急消息,通常是崩溃前的消息。

Used for emergency messages, usually those that precede a crash.

KERN_ALERT
KERN_ALERT

需要立即采取行动的情况。

A situation requiring immediate action.

KERN_CRIT
KERN_CRIT

关键情况,通常与严重的硬件或软件故障有关。

Critical conditions, often related to serious hardware or software failures.

KERN_ERR
KERN_ERR

用于报告错误情况;设备驱动程序经常用于KERN_ERR报告硬件问题。

Used to report error conditions; device drivers often use KERN_ERR to report hardware difficulties.

KERN_WARNING
KERN_WARNING

有关问题情况的警告,这些情况本身不会对系统造成严重问题。

Warnings about problematic situations that do not, in themselves, create serious problems with the system.

KERN_NOTICE
KERN_NOTICE

这些情况很正常,但仍然值得注意。许多与安全相关的情况都在此级别报告。

Situations that are normal, but still worthy of note. A number of security-related conditions are reported at this level.

KERN_INFO
KERN_INFO

信息性消息。许多驱动程序在启动时在此级别打印有关它们找到的硬件的信息。

Informational messages. Many drivers print information about the hardware they find at startup time at this level.

KERN_DEBUG
KERN_DEBUG

用于调试消息。

Used for debugging messages.

每个字符串(在宏扩展中)代表尖括号中的一个整数。整数范围为 0 到 7,值越小代表优先级越高。

Each string (in the macro expansion) represents an integer in angle brackets. Integers range from 0 to 7, with smaller values representing higher priorities.

没有指定优先级的printk语句 默认为DEFAULT_MESSAGE_LOGLEVEL,在kernel/printk.c中指定为整数。在 2.6.10 内核中,DEFAULT_MESSAGE_LOGLEVELKERN_WARNING,但众所周知,它在过去已经发生了变化。

A printk statement with no specified priority defaults to DEFAULT_MESSAGE_LOGLEVEL, specified in kernel/printk.c as an integer. In the 2.6.10 kernel, DEFAULT_MESSAGE_LOGLEVEL is KERN_WARNING, but that has been known to change in the past.

根据日志级别,内核可以将消息打印到当前控制台,无论是文本模式终端、串行端口还是并行打印机。如果优先级小于整数变量console_loglevel,则消息一次一行传递到控制台(除非提供尾随换行符,否则不会发送任何内容)。如果klogdsyslogd都在系统上运行,则内核消息将附加到/var/log/messages(或根据您的 syslogd配置进行其他处理),独立于console_loglevel. 如果klogd未运行,则消息不会到达用户空间,除非您读取/proc/kmsg (通常使用dmesg最容易完成 命令)。使用klogd时,您应该记住它不会保存连续的相同行;它只保存第一个这样的行,并在稍后的时间保存它收到的重复次数。

Based on the loglevel, the kernel may print the message to the current console, be it a text-mode terminal, a serial port, or a parallel printer. If the priority is less than the integer variable console_loglevel, the message is delivered to the console one line at a time (nothing is sent unless a trailing newline is provided). If both klogd and syslogd are running on the system, kernel messages are appended to /var/log/messages (or otherwise treated depending on your syslogd configuration), independent of console_loglevel. If klogd is not running, the message won't reach user space unless you read /proc/kmsg (which is often most easily done with the dmesg command). When using klogd, you should remember that it doesn't save consecutive identical lines; it only saves the first such line and, at a later time, the number of repetitions it received.

该变量 console_loglevel已初始化DEFAULT_CONSOLE_LOGLEVEL并可以通过 sys_syslog系统调用进行修改。更改它的一种方法是在 调用klogd时指定-c开关,如klogd联机帮助页中所指定。请注意,要更改当前值,必须首先终止klogd,然后使用-c 选项重新启动它。或者,您可以编写一个程序来更改控制台日志级别。您可以在misc-progs/setlevel.c中找到此类程序的一个版本 在 O'Reilly 的 FTP 站点上提供的源文件中。新级别指定为 1 到 8 之间的整数值(含 1 和 8)。如果设置为1,则只有级别 0 ( ) 的消息KERN_EMERG到达控制台;如果设置为8,则显示所有消息,包括调试消息。

The variable console_loglevel is initialized to DEFAULT_CONSOLE_LOGLEVEL and can be modified through the sys_syslog system call. One way to change it is by specifying the -c switch when invoking klogd, as specified in the klogd manpage. Note that to change the current value, you must first kill klogd and then restart it with the -c option. Alternatively, you can write a program to change the console loglevel. You'll find a version of such a program in misc-progs/setlevel.c in the source files provided on O'Reilly's FTP site. The new level is specified as an integer value between 1 and 8, inclusive. If it is set to 1, only messages of level 0 (KERN_EMERG) reach the console; if it is set to 8, all messages, including debugging ones, are displayed.

还可以使用文本文件/proc/sys/kernel/printk读取和修改控制台日志级别。该文件包含四个整数值:当前日志级别、缺少显式日志级别的消息的默认级别、允许的最小日志级别以及启动时默认日志级别。向此文件写入单个值会将当前日志级别更改为该值;因此,例如,您只需输入以下内容即可使所有内核消息显示在控制台上:

It is also possible to read and modify the console loglevel using the text file /proc/sys/kernel/printk. The file hosts four integer values: the current loglevel, the default level for messages that lack an explicit loglevel, the minimum allowed loglevel, and the boot-time default loglevel. Writing a single value to this file changes the current loglevel to that value; thus, for example, you can cause all kernel messages to appear at the console by simply entering:

#echo 8 > /proc/sys/kernel/printk
 # echo 8 > /proc/sys/kernel/printk

现在应该很清楚为什么hello.c样本有KERN_ALERT;标记了;他们的作用是确保消息出现在控制台上。

It should now be apparent why the hello.c sample had the KERN_ALERT; markers; they are there to make sure that the messages appear on the console.

重定向控制台消息

Redirecting Console Messages

Linux 允许您将消息发送到特定的虚拟控制台(如果您的控制台位于文本屏幕上),从而使控制台日志记录策略具有一定的灵活性。默认情况下,“控制台”是当前的虚拟终端。要选择不同的虚拟终端来接收消息,您可以ioctl(TIOCLINUX)在任何控制台设备上发出。以下程序, setconsole ,可用于选择哪个控制台接收内核消息;它必须由超级用户运行并且可以在misc-progs中找到 目录。

Linux allows for some flexibility in console logging policies by letting you send messages to a specific virtual console (if your console lives on the text screen). By default, the "console" is the current virtual terminal. To select a different virtual terminal to receive messages, you can issue ioctl(TIOCLINUX) on any console device. The following program, setconsole , can be used to choose which console receives kernel messages; it must be run by the superuser and is available in the misc-progs directory.

以下是整个程序。您应该使用单个参数来调用它,指定要接收消息的控制台的编号。

The following is the program in its entirety. You should invoke it with a single argument specifying the number of the console that is to receive messages.

int main(int argc, char **argv)
{
    字符字节[2] = {11,0}; /* 11 是 TIOCLINUX cmd 号 */

    if (argc= =2) bytes[1] = atoi(argv[1]); /* 选择的控制台 */
    别的 {
        fprintf(stderr, "%s: 需要一个 arg\n",argv[0]); 退出(1);
    }
    if (ioctl(STDIN_FILENO, TIOCLINUX, bytes)<0) { /* 使用标准输入 */
        fprintf(stderr,"%s: ioctl(stdin, TIOCLINUX): %s\n",
                argv[0], strerror(errno));
        退出(1);
    }
    退出(0);
}
int main(int argc, char **argv)
{
    char bytes[2] = {11,0}; /* 11 is the TIOCLINUX cmd number */

    if (argc=  =2) bytes[1] = atoi(argv[1]); /* the chosen console */
    else {
        fprintf(stderr, "%s: need a single arg\n",argv[0]); exit(1);
    }
    if (ioctl(STDIN_FILENO, TIOCLINUX, bytes)<0) {    /* use stdin */
        fprintf(stderr,"%s: ioctl(stdin, TIOCLINUX): %s\n",
                argv[0], strerror(errno));
        exit(1);
    }
    exit(0);
}

setconsole使用特殊的ioctl命令 TIOCLINUX,它实现了 Linux 特有的功能。要使用TIOCLINUX,您可以向它传递一个参数,该参数是指向字节数组的指针。数组的第一个字节是指定所请求的子命令的数字,后面的字节是特定于子命令的。在 setconsole中,使用子命令 11,下一个字节(存储在 中 bytes[1])标识虚拟控制台。完整的描述在内核源代码中的drivers/char/tty_io.cTIOCLINUX中找到

setconsole uses the special ioctl command TIOCLINUX, which implements Linux-specific functions. To use TIOCLINUX, you pass it an argument that is a pointer to a byte array. The first byte of the array is a number that specifies the requested subcommand, and the following bytes are subcommand specific. In setconsole, subcommand 11 is used, and the next byte (stored in bytes[1]) identifies the virtual console. The complete description of TIOCLINUX can be found in drivers/char/tty_io.c, in the kernel sources.

消息如何记录

How Messages Get Logged

printk函数将消息写入字节长的循环缓冲区 _ _LOG_BUF_LEN:配置内核时选择的 4 KB 到 1 MB 之间的值。然后,该函数会唤醒任何正在等待消息的进程,即在syslog系统调用中休眠 或正在读取/proc/kmsg的任何进程。这两个日志引擎接口几乎是等效的,但请注意,从 /proc/kmsg读取会消耗日志缓冲区中的数据,而syslog系统调用可以选择返回日志数据,同时将其留给其他进程。一般来说,读取/procfile 更容易,并且是klogd的默认行为 。dmesg命令可用于查看缓冲区的内容而不刷新它;实际上,该命令将缓冲区的全部内容返回到 stdout,无论它是否已被读取。

The printk function writes messages into a circular buffer that is _ _LOG_BUF_LEN bytes long: a value from 4 KB to 1 MB chosen while configuring the kernel. The function then wakes any process that is waiting for messages, that is, any process that is sleeping in the syslog system call or that is reading /proc/kmsg. These two interfaces to the logging engine are almost equivalent, but note that reading from /proc/kmsg consumes the data from the log buffer, whereas the syslog system call can optionally return log data while leaving it for other processes as well. In general, reading the /proc file is easier and is the default behavior for klogd. The dmesg command can be used to look at the content of the buffer without flushing it; actually, the command returns to stdout the whole content of the buffer, whether or not it has already been read.

如果您碰巧手动读取内核消息,在停止 klogd后,您会发现/proc文件看起来像一个 FIFO,因为读取器会阻塞,等待更多数据。显然,如果klogd或其他进程已经在读取相同的数据,您就无法以这种方式读取消息,因为您会争夺它。

If you happen to read the kernel messages by hand, after stopping klogd, you'll find that the /proc file looks like a FIFO, in that the reader blocks, waiting for more data. Obviously, you can't read messages this way if klogd or another process is already reading the same data, because you'll contend for it.

如果循环缓冲区已满,printk会回绕并开始将新数据添加到缓冲区的开头,覆盖最旧的数据。因此,记录过程会丢失最旧的数据。与使用这种循环缓冲区的优点相比,这个问题可以忽略不计。例如,循环缓冲区允许系统即使没有日志记录进程也可以运行,同时通过覆盖旧数据(如果没有人读取旧数据)来最大限度地减少内存浪费。Linux 消息传递方法的另一个特点是 printk可以从任何地方调用,甚至可以从中断处理程序调用,并且对可以打印的数据量没有限制。唯一的缺点是可能会丢失一些数据。

If the circular buffer fills up, printk wraps around and starts adding new data to the beginning of the buffer, overwriting the oldest data. Therefore, the logging process loses the oldest data. This problem is negligible compared with the advantages of using such a circular buffer. For example, a circular buffer allows the system to run even without a logging process, while minimizing memory waste by overwriting old data should nobody read it. Another feature of the Linux approach to messaging is that printk can be invoked from anywhere, even from an interrupt handler, with no limit on how much data can be printed. The only disadvantage is the possibility of losing some data.

如果 klogd进程正在运行,它会检索内核消息并将它们分派到syslogd,后者又检查/etc/syslog.conf以找出如何处理它们。 syslogd根据设施和优先级区分消息;设施和优先级的允许值在 <sys/syslog.h>中定义。内核消息由该工具以与printkLOG_KERN中使用的优先级相对应的优先级进行记录(例如,用于消息)。如果 克洛格LOG_ERRKERN_ERR不运行时,数据保留在循环缓冲区中,直到有人读取它或缓冲区溢出。

If the klogd process is running, it retrieves kernel messages and dispatches them to syslogd, which in turn checks /etc/syslog.conf to find out how to deal with them. syslogd differentiates between messages according to a facility and a priority; allowable values for both the facility and the priority are defined in <sys/syslog.h>. Kernel messages are logged by the LOG_KERN facility at a priority corresponding to the one used in printk (for example, LOG_ERR is used for KERN_ERR messages). If klogd isn't running, data remains in the circular buffer until someone reads it or the buffer overflows.

如果您想避免来自驱动程序的监视消息破坏系统日志,您可以为klogd指定-f(文件)选项 以指示它将消息保存到特定文件,或自定义/etc/syslog.conf以适合您的需求。另一种可能性是采用暴力方法:杀死klogd 并在未使用的虚拟终端上详细打印消息,[ 1 ]或从未使用的 xterm发出命令cat /proc/kmsg

If you want to avoid clobbering your system log with the monitoring messages from your driver, you can either specify the -f (file) option to klogd to instruct it to save messages to a specific file, or customize /etc/syslog.conf to suit your needs. Yet another possibility is to take the brute-force approach: kill klogd and verbosely print messages on an unused virtual terminal,[1] or issue the command cat /proc/kmsg from an unused xterm.

打开和关闭消息

Turning the Messages On and Off

在驱动程序开发的早期阶段, printk可以为调试和测试新代码提供很大帮助。另一方面,当您正式发布驱动程序时,您应该删除或至少禁用此类打印语句。不幸的是,您可能会发现,一旦您认为不再需要这些消息并删除它们,您就在驱动程序中实现了一项新功能(或者有人发现了错误),并且您想要至少更改其中一个消息重新开启。有多种方法可以解决这两个问题,全局启用或禁用调试消息以及打开或关闭单个消息。

During the early stages of driver development, printk can help considerably in debugging and testing new code. When you officially release the driver, on the other hand, you should remove, or at least disable, such print statements. Unfortunately, you're likely to find that as soon as you think you no longer need the messages and remove them, you implement a new feature in the driver (or somebody finds a bug), and you want to turn at least one of the messages back on. There are several ways to solve both issues, to globally enable or disable your debug messages and to turn individual messages on or off.

在这里,我们展示了一种对printk调用进行编码的方法,以便您可以单独或全局地打开和关闭它们;该技术取决于定义一个宏,该宏在您需要时解析为printk(或printf )调用:

Here we show one way to code printk calls so you can turn them on and off individually or globally; the technique depends on defining a macro that resolves to a printk (or printf ) call when you want it to:

  • 通过在宏名称中删除或添加单个字母,可以启用或禁用每个打印语句。

  • Each print statement can be enabled or disabled by removing or adding a single letter to the macro's name.

  • 通过在编译前更改变量的值,可以立即禁用所有消息CFLAGS

  • All the messages can be disabled at once by changing the value of the CFLAGS variable before compiling.

  • 相同的打印语句可以在内核代码和用户级代码中使用,以便驱动程序和测试程序可以以相同的方式管理额外的消息。

  • The same print statement can be used in kernel code and user-level code, so that the driver and test programs can be managed in the same way with regard to extra messages.

以下代码片段实现了这些功能并直接来自标头 scull.h

The following code fragment implements these features and comes directly from the header scull.h:

#undef PDEBUG /* undef 它,以防万一 */
#ifdef SCULL_DEBUG
# ifdef _ _KERNEL_ _
     /* 如果调试打开,则此一项,以及内核空间 */
# 定义 PDEBUG(fmt, args...) printk( KERN_DEBUG "scull: " fmt, ## args)
# 别的
     /* 这个用于用户空间 */
# 定义 PDEBUG(fmt, args...) fprintf(stderr, fmt, ## args)
# 万一
#别的
#define PDEBUG(fmt, args...) /* 不调试:什么都没有 */
#万一

#undef PDEBUGG
#define PDEBUGG(fmt, args...) /* 什么也没有:它是一个占位符 */
#undef PDEBUG             /* undef it, just in case */
#ifdef SCULL_DEBUG
#  ifdef _ _KERNEL_ _
     /* This one if debugging is on, and kernel space */
#    define PDEBUG(fmt, args...) printk( KERN_DEBUG "scull: " fmt, ## args)
#  else
     /* This one for user space */
#    define PDEBUG(fmt, args...) fprintf(stderr, fmt, ## args)
#  endif
#else
#  define PDEBUG(fmt, args...) /* not debugging: nothing */
#endif

#undef PDEBUGG
#define PDEBUGG(fmt, args...) /* nothing: it's a placeholder */

该符号 PDEBUG是已定义还是未定义,具体取决于是否 已定义,并以适合代码运行环境的任何方式显示信息:当它位于内核中时, SCULL_DEBUG它使用内核调用printk ,而libc调用fprintf则显示标准错误当在用户空间运行时。PDEBUGG另一方面,该符号不执行任何操作;它可用于轻松“注释”打印语句,而无需完全删除它们。

The symbol PDEBUG is defined or undefined, depending on whether SCULL_DEBUG is defined, and displays information in whatever manner is appropriate to the environment where the code is running: it uses the kernel call printk when it's in the kernel and the libc call fprintf to the standard error when run in user space. The PDEBUGG symbol, on the other hand, does nothing; it can be used to easily "comment" print statements without removing them entirely.

要进一步简化该过程,请将以下行添加到您的 makefile 中:

To simplify the process further, add the following lines to your makefile:

# 注释/取消注释以下行以禁用/启用调试
调试 = y

# 将调试标志添加(或不添加)到 CFLAGS
ifeq ($(调试),y)
  DEBFLAGS = -O -g -DSCULL_DEBUG # 需要“-O”来扩展内联
别的
  除爆标志 = -O2
万一

CFLAGS += $(DEBFLAGS)
# Comment/uncomment the following line to disable/enable debugging
DEBUG = y

# Add your debugging flag (or not) to CFLAGS
ifeq ($(DEBUG),y)
  DEBFLAGS = -O -g -DSCULL_DEBUG # "-O" is needed to expand inlines
else
  DEBFLAGS = -O2
endif

CFLAGS += $(DEBFLAGS)

本节中显示的宏依赖于ANSI C 预处理器的gcc扩展,该扩展支持具有可变数量参数的宏。这种 gcc依赖性应该不是问题,因为内核本身很大程度上依赖于gcc功能。除此之外makefile 取决于 GNU 的 make版本;再一次,内核已经依赖于 GNU make,所以这种依赖不是问题。

The macros shown in this section depend on a gcc extension to the ANSI C preprocessor that supports macros with a variable number of arguments. This gcc dependency shouldn't be a problem, because the kernel proper depends heavily on gcc features anyway. In addition, the makefile depends on GNU's version of make ; once again, the kernel already depends on GNU make, so this dependency is not a problem.

如果您熟悉 C 预处理器,则可以扩展给定的定义来实现“调试级别”的概念,定义不同的级别并为每个级别分配一个整数(或位掩码)值以确定其详细程度是。

If you're familiar with the C preprocessor, you can expand on the given definitions to implement the concept of a "debug level," defining different levels and assigning an integer (or bit mask) value to each level to determine how verbose it should be.

但每个驱动程序都有自己的功能和监控需求。良好编程的艺术在于在灵活性和效率之间选择最佳权衡,我们无法告诉您什么是最适合您的。请记住,预处理器条件(以及代码中的常量表达式)是在编译时执行的,因此您必须重新编译才能打开或关闭消息。一种可能的替代方法是使用 C 条件语句,它在运行时执行,因此允许您在程序执行期间打开和关闭消息传递。这是一个很好的功能,但每次执行代码时都需要进行额外的处理,即使消息被禁用,这也会影响性能。有时这种性能损失是不可接受的。

But every driver has its own features and monitoring needs. The art of good programming is in choosing the best trade-off between flexibility and efficiency, and we can't tell what is the best for you. Remember that preprocessor conditionals (as well as constant expressions in the code) are executed at compile time, so you must recompile to turn messages on or off. A possible alternative is to use C conditionals, which are executed at runtime and, therefore, permit you to turn messaging on and off during program execution. This is a nice feature, but it requires additional processing every time the code is executed, which can affect performance even when the messages are disabled. Sometimes this performance hit is unacceptable.

本节中显示的宏已证明在许多情况下都很有用,唯一的缺点是在对其消息进行任何更改后需要重新编译模块。

The macros shown in this section have proven themselves useful in a number of situations, with the only disadvantage being the requirement to recompile a module after any changes to its messages.

速率限制

Rate Limiting

如果你不小心,你 发现自己使用printk生成了数千条消息,使控制台不堪重负,甚至可能溢出系统日志文件。当使用慢速控制台设备(例如串行端口)时,过高的消息速率也会减​​慢系统速度或使其无响应。当控制台不间断地喷出数据时,很难了解系统出了什么问题。因此,您应该非常小心打印的内容,特别是在驱动程序的生产版本中,尤其是在初始化完成后。一般来说,生产代码在正常操作期间不应打印任何内容;打印输出应表明需要注意的特殊情况。

If you are not careful, you can find yourself generating thousands of messages with printk, overwhelming the console and, possibly, overflowing the system log file. When using a slow console device (e.g., a serial port), an excessive message rate can also slow down the system or just make it unresponsive. It can be very hard to get a handle on what is wrong with a system when the console is spewing out data nonstop. Therefore, you should be very careful about what you print, especially in production versions of drivers and especially once initialization is complete. In general, production code should never print anything during normal operation; printed output should be an indication of an exceptional situation requiring attention.

另一方面,如果您正在驾驶的设备停止工作,您可能希望发出日志消息。但你应该小心,不要做得太过分。一个在失败面前永远持续下去的不智能的进程每秒可能会产生数千次重试;如果你的驱动程序每次都打印“我的设备坏了”消息,它可能会产生大量输出,并且如果控制台设备速度缓慢,则可能会占用 CPU — 无法使用中断来驱动控制台,即使它是串行设备端口或行式打印机。

On the other hand, you may want to emit a log message if a device you are driving stops working. But you should be careful not to overdo things. An unintelligent process that continues forever in the face of failures can generate thousands of retries per second; if your driver prints a "my device is broken" message every time, it could create vast amounts of output and possibly hog the CPU if the console device is slow—no interrupts can be used to driver the console, even if it is a serial port or a line printer.

在许多情况下,最好的行为是设置一个标志,表示“我已经对此进行了抱怨”,并且一旦设置了标志,就不再打印任何进一步的消息。但在其他情况下,有理由偶尔发出“设备仍然损坏”的通知。内核提供了一个在这种情况下可以提供帮助的函数:

In many cases, the best behavior is to set a flag saying, "I have already complained about this," and not print any further messages once the flag gets set. In others, though, there are reasons to emit an occasional "the device is still broken" notice. The kernel has provided a function that can be helpful in such cases:

int printk_ratelimit(void);
int printk_ratelimit(void);

在考虑打印可能经常重复的消息之前,应该调用此函数。如果函数返回非零值,则继续打印消息,否则跳过它。因此,典型的调用如下所示:

This function should be called before you consider printing a message that could be repeated often. If the function returns a nonzero value, go ahead and print your message, otherwise skip it. Thus, typical calls look like this:

如果(printk_ratelimit())
    printk(KERN_NOTICE "打印机仍然着火\n");
if (printk_ratelimit(  ))
    printk(KERN_NOTICE "The printer is still on fire\n");

printk_ratelimit 的工作原理是跟踪发送到控制台的消息数量。当输出级别超过阈值时, printk_ratelimit开始返回0并导致消息被丢弃。

printk_ratelimit works by tracking how many messages are sent to the console. When the level of output exceeds a threshold, printk_ratelimit starts returning 0 and causing messages to be dropped.

printk_ratelimit的行为可以通过修改/proc/sys/kernel/printk_ratelimit(重新启用消息之前等待的秒数)来自定义, 并且是/proc/sys/kernel/printk_ratelimit_burst(速率限制之前接受的消息数) )。

The behavior of printk_ratelimit can be customized by modifying /proc/sys/kernel/printk_ratelimit (the number of seconds to wait before re-enabling messages) and are /proc/sys/kernel/printk_ratelimit_burst (the number of messages accepted before rate-limiting).

打印设备编号

Printing Device Numbers

有时,当从驱动程序打印消息时,您可能需要打印与驱动程序关联的设备号 感兴趣的硬件。打印主编号和次编号并不是特别困难,但是,为了保持一致性,内核为此提供了几个实用宏(在<linux/kdev_t.h>中定义):

Occasionally, when printing a message from a driver, you will want to print the device number associated withp the hardware of interest. It is not particularly hard to print the major and minor numbers, but, in the interest of consistency, the kernel provides a couple of utility macros (defined in <linux/kdev_t.h>) for this purpose:

int print_dev_t(char *buffer, dev_t dev);
char *format_dev_t(char *buffer, dev_t dev);
int print_dev_t(char *buffer, dev_t dev);
char *format_dev_t(char *buffer, dev_t dev);

两个宏都将设备编号编码为给定的buffer;唯一的区别是print_dev_t返回打印的字符数,而format_dev_t返回buffer; 因此,它可以直接用作 printk调用的参数,尽管必须记住 printk在提供尾随换行符之前不会刷新。缓冲区应该足够大以容纳设备号;考虑到 64 位设备号很可能是 未来 内核发布时,缓冲区应该至少有 20 个字节长。

Both macros encode the device number into the given buffer; the only difference is that print_dev_t returns the number of characters printed, while format_dev_t returns buffer; therefore, it can be used as a parameter to a printk call directly, although one must remember that printk doesn't flush until a trailing newline is provided. The buffer should be large enough to hold a device number; given that 64-bit device numbers are a distinct possibility in future kernel releases, the buffer should probably be at least 20 bytes long.

通过查询进行调试

Debugging by Querying

上一节 描述了printk 的工作原理 以及如何使用它。它没有谈论的是它的缺点。

The previous section described how printk works and how it can be used. What it didn't talk about are its disadvantages.

大量使用printk会减慢速度即使您降低 console_loglevel以避免加载控制台设备,也会明显关闭系统,因为 syslogd会不断同步其输出文件;因此,打印的每一行都会引起磁盘操作。从syslogd的角度来看,这是正确的实现 。它会尝试将所有内容写入磁盘,以防系统在打印消息后立即崩溃;但是,您不想仅仅为了调试消息而减慢系统速度。这个问题可以通过在/etc/syslogd.conf中出现的日志文件名称前加上连字符来解决。[ 2 ]更改配置文件的问题在于,即使在正常的系统操作期间您确实希望消息尽快刷新到磁盘,但在完成调试后,修改可能仍保留在那里。这种永久更改的替代方法是运行klogd之外的程序(例如 cat /proc/kmsg,如前所述),但这可能无法为正常系统操作提供合适的环境。

A massive use of printk can slow down the system noticeably, even if you lower console_loglevel to avoid loading the console device, because syslogd keeps syncing its output files; thus, every line that is printed causes a disk operation. This is the right implementation from syslogd 's perspective. It tries to write everything to disk in case the system crashes right after printing the message; however, you don't want to slow down your system just for the sake of debugging messages. This problem can be solved by prefixing the name of your log file as it appears in /etc/syslogd.conf with a hyphen.[2] The problem with changing the configuration file is that the modification will likely remain there after you are done debugging, even though during normal system operation you do want messages to be flushed to disk as soon as possible. An alternative to such a permanent change is running a program other than klogd (such as cat /proc/kmsg, as suggested earlier), but this may not provide a suitable environment for normal system operation.

通常,获取相关信息的最佳方式是在需要信息时查询系统,而不是不断生成数据。事实上,每个 Unix 系统都提供了许多用于获取系统信息的工具:psnetstatvmstat等。

More often than not, the best way to get relevant information is to query the system when you need the information, instead of continually producing data. In fact, every Unix system provides many tools for obtaining system information: ps, netstat, vmstat, and so on.

驱动程序开发人员可以使用一些技术来查询系统:在/proc文件系统中创建文件、使用 ioctl驱动程序方法以及通过 sysfs导出属性。sysfs的使用需要相当多的驱动程序模型背景知识。第 14 章对此进行了讨论。

A few techniques are available to driver developers for querying the system: creating a file in the /proc filesystem, using the ioctl driver method, and exporting attributes via sysfs. The use of sysfs requires quite some background on the driver model. It is discussed in Chapter 14.

使用 /proc 文件系统

Using the /proc Filesystem

/proc文件系统是一种特殊的、由软件创建的文件系统,内核使用它向外界导出信息。/proc下的每个文件都与一个内核函数相关联,该函数在读取文件时动态生成文件的“内容”。我们已经看到其中一些文件正在运行;例如,/proc/modules始终返回当前加载的模块的列表。

The /proc filesystem is a special, software-created filesystem that is used by the kernel to export information to the world. Each file under /proc is tied to a kernel function that generates the file's "contents" on the fly when the file is read. We have already seen some of these files in action; /proc/modules, for example, always returns a list of the currently loaded modules.

/proc在 Linux 系统中被大量使用。现代 Linux 发行版上的许多实用程序(例如pstopuptime )从/proc获取信息 。某些设备驱动程序还通过/proc导出信息,您的设备驱动程序也可以这样做。/proc文件系统是动态的,因此您的模块可以随时添加或删除条目。

/proc is heavily used in the Linux system. Many utilities on a modern Linux distribution, such as ps, top, and uptime, get their information from /proc. Some device drivers also export information via /proc, and yours can do so as well. The /proc filesystem is dynamic, so your module can add or remove entries at any time.

功能齐全的/proc条目可能非常复杂;除此之外,它们可以被写入和读取。然而,大多数时候,/proc条目是只读文件。本节涉及简单的只读情况。那些有兴趣实现更复杂的东西的人可以在这里查看基础知识;然后可以查阅内核源代码以了解完整情况。

Fully featured /proc entries can be complicated beasts; among other things, they can be written to as well as read from. Most of the time, however, /proc entries are read-only files. This section concerns itself with the simple read-only case. Those who are interested in implementing something more complicated can look here for the basics; the kernel source may then be consulted for the full picture.

然而,在继续之前,我们应该提到不鼓励在/proc下添加文件。/proc文件系统被内核开发人员视为有点不受控制的混乱,远远超出了其最初的目的(即提供有关系统中运行的进程的信息)。在新代码中提供信息的推荐方法是通过 sysfs。然而,正如所建议的,使用 sysfs 需要了解 Linux 设备模型,直到第 14 章我们才会了解这一点。同时,/proc下的文件更容易创建,并且它们完全适合调试目的,因此我们在这里介绍它们。

Before we continue, however, we should mention that adding files under /proc is discouraged. The /proc filesystem is seen by the kernel developers as a bit of an uncontrolled mess that has gone far beyond its original purpose (which was to provide information about the processes running in the system). The recommended way of making information available in new code is via sysfs. As suggested, working with sysfs requires an understanding of the Linux device model, however, and we do not get to that until Chapter 14. Meanwhile, files under /proc are slightly easier to create, and they are entirely suitable for debugging purposes, so we cover them here.

在 /proc 中实现文件

Implementing files in /proc

所有与/proc一起使用的模块 应包含<linux/proc_fs.h>来定义正确的函数。

All modules that work with /proc should include <linux/proc_fs.h> to define the proper functions.

创建一个只读/proc 文件,您的 驱动程序必须实现一个函数来在读取文件时生成数据。当某个进程读取文件(使用read系统调用)时,请求通过此函数到达您的模块。我们将首先查看这个函数,然后在本节稍后介绍注册界面。

To create a read-only /proc file, your driver must implement a function to produce the data when the file is read. When some process reads the file (using the read system call), the request reaches your module by means of this function. We'll look at this function first and get to the registration interface later in this section.

当进程从/proc文件中读取数据时,内核会分配一页内存(即PAGE_SIZE 字节),驱动程序可以在其中写入要返回到用户空间的数据。该缓冲区将传递给您的函数,该函数是一个名为read_proc的方法:

When a process reads from your /proc file, the kernel allocates a page of memory (i.e., PAGE_SIZE bytes) where the driver can write data to be returned to user space. That buffer is passed to your function, which is a method called read_proc:

int (*read_proc)(char *page, char **start, off_t offset, int count,
                 int *eof, void *data);
int (*read_proc)(char *page, char **start, off_t offset, int count, 
                 int *eof, void *data);

指针page是您将在其中写入数据的缓冲区;start函数使用它来说明有趣的数据已写入何处page(稍后会详细介绍);offsetreadcount方法具有相同的含义。该参数指向一个必须由驱动程序设置的整数,以表明它没有更多数据要返回,而 是一个特定于驱动程序的数据指针,可用于内部簿记。eofdata

The page pointer is the buffer where you'll write your data; start is used by the function to say where the interesting data has been written in page (more on this later); offset and count have the same meaning as for the read method. The eof argument points to an integer that must be set by the driver to signal that it has no more data to return, while data is a driver-specific data pointer you can use for internal bookkeeping.

该函数应该返回实际放置在缓冲区中的数据的字节数 page,就像 read方法对其他文件所做的那样。其他输出值为 *eof*starteof是一个简单的标志,但是值的使用start有些复杂;其目的是帮助执行大型(大于一页)/proc文件。

This function should return the number of bytes of data actually placed in the page buffer, just like the read method does for other files. Other output values are *eof and *start. eof is a simple flag, but the use of the start value is somewhat more complicated; its purpose is to help with the implementation of large (greater than one page) /proc files.

start参数有一些非常规的用途。其目的是指示在何处(在 内page)找到要返回给用户的数据。当你的proc_read 方法被调用,*start将是NULL. 如果您保留它NULL,内核会假设数据已pageoffset放入0;换句话说,它假设了一个简单的proc_read版本,它将虚拟文件的全部内容放入其中,page而不注意参数offset。相反,如果您设置*start为非NULL值,则内核会假定 指向的数据已考虑*start在内 offset并已准备好直接返回给用户。一般来说,返回少量数据的简单proc_read方法会被忽略start。更复杂的方法设置*startpage并且仅将数据从请求的偏移量开始放置。

The start parameter has a somewhat unconventional use. Its purpose is to indicate where (within page) the data to be returned to the user is found. When your proc_read method is called, *start will be NULL. If you leave it NULL, the kernel assumes that the data has been put into page as if offset were 0; in other words, it assumes a simple-minded version of proc_read, which places the entire contents of the virtual file in page without paying attention to the offset parameter. If, instead, you set *start to a non-NULL value, the kernel assumes that the data pointed to by *start takes offset into account and is ready to be returned directly to the user. In general, simple proc_read methods that return tiny amounts of data just ignore start. More complex methods set *start to page and only place data beginning at the requested offset there.

/proc文件长期以来一直存在另一个主要问题,这start 也是要解决的。有时,内核数据结构的 ASCII 表示在连续调用read之间会发生变化,因此读取器进程可能会发现从一次调用到下一次调用的数据不一致。如果*start设置为一个小整数值,调用者将使用它来 filp->f_pos独立于返回的数据量进行递增,从而为read_procf_pos过程创建一个内部记录号。例如,如果您的 read_proc 函数从一个大的结构数组中返回信息,其中五个结构在第一次调用中返回,*start可以设置为5. 下一个调用提供与偏移量相同的值;然后驱动程序知道开始从数组中的第六个结构返回数据。这被其作者视为“黑客”,可以在fs/proc/generic.c中看到。

There has long been another major issue with /proc files, which start is meant to solve as well. Sometimes the ASCII representation of kernel data structures changes between successive calls to read, so the reader process could find inconsistent data from one call to the next. If *start is set to a small integer value, the caller uses it to increment filp->f_pos independently of the amount of data you return, thus making f_pos an internal record number of your read_proc procedure. If, for example, your read_proc function is returning information from a big array of structures, and five of those structures were returned in the first call, *start could be set to 5. The next call provides that same value as the offset; the driver then knows to start returning data from the sixth structure in the array. This is acknowledged as a "hack" by its authors and can be seen in fs/proc/generic.c.

请注意,有一个更好的方法来实现大型/proc文件;它被称为seq_file,我们很快就会讨论它。不过,首先是举个例子的时候了。这是一个简单的(虽然有点难看)read_proc实现 划桨装置:

Note that there is a better way to implement large /proc files; it's called seq_file, and we'll discuss it shortly. First, though, it is time for an example. Here is a simple (if somewhat ugly) read_proc implementation for the scull device:

int scull_read_procmem(char *buf, char **start, off_t 偏移量,
                   int 计数、int *eof、void *数据)
{
    int i、j、len = 0;
    int limit = count - 80;/* 不要打印更多内容 */

    for (i = 0; i < scull_nr_devs && len <= limit; i++) {
        struct scull_dev *d = &scull_devices[i];
        struct scull_qset *qs = d->数据;
        if (down_interruptible(&d->sem))
            返回-ERESTARTSYS;
        len += sprintf(buf+len,"\n设备 %i: qset %i, q %i, sz %li\n",
                i, d->qset, d->量子, d->size);
        for (; qs && len <= limit; qs = qs->next) { /* 扫描列表 */
            len += sprintf(buf + len, " 项目位于 %p,qset 位于 %p\n",
                    qs,qs->数据);
            if (qs->data && !qs->next) /* 仅转储最后一项 */
                for (j = 0; j < d->qset; j++) {
                    if (qs->数据[j])
                        len += sprintf(buf + len,
                                " % 4i: %8p\n",
                                j, qs->数据[j]);
                }
        }
        向上(&scull_devices[i].sem);
    }
    *eof = 1;
    返回长度;
}
int scull_read_procmem(char *buf, char **start, off_t offset,
                   int count, int *eof, void *data)
{
    int i, j, len = 0;
    int limit = count - 80; /* Don't print more than this */

    for (i = 0; i < scull_nr_devs && len <= limit; i++) {
        struct scull_dev *d = &scull_devices[i];
        struct scull_qset *qs = d->data;
        if (down_interruptible(&d->sem))
            return -ERESTARTSYS;
        len += sprintf(buf+len,"\nDevice %i: qset %i, q %i, sz %li\n",
                i, d->qset, d->quantum, d->size);
        for (; qs && len <= limit; qs = qs->next) { /* scan the list */
            len += sprintf(buf + len, "  item at %p, qset at %p\n",
                    qs, qs->data);
            if (qs->data && !qs->next) /* dump only the last item */
                for (j = 0; j < d->qset; j++) {
                    if (qs->data[j])
                        len += sprintf(buf + len,
                                "    % 4i: %8p\n",
                                j, qs->data[j]);
                }
        }
        up(&scull_devices[i].sem);
    }
    *eof = 1;
    return len;
}

这是一个相当典型的read_proc实现。它假设永远不需要生成多于一页的数据,因此忽略startoffset值。然而,为了以防万一,要小心不要超出其缓冲区。

This is a fairly typical read_proc implementation. It assumes that there will never be a need to generate more than one page of data and so ignores the start and offset values. It is, however, careful not to overrun its buffer, just in case.

较旧的界面

An older interface

如果你通读内核源代码, 您可能会遇到使用旧接口实现/proc文件的代码:

If you read through the kernel source, you may encounter code implementing /proc files with an older interface:

int (*get_info)(char *page, char **start, off_t offset, int count);
int (*get_info)(char *page, char **start, off_t offset, int count);

所有参数的含义与read_proc的含义相同 ,但缺少eof和 参数。data该接口仍然受支持,但将来可能会消失;新代码应该使用 read_proc接口。

All of the arguments have the same meaning as they do for read_proc, but the eof and data arguments are missing. This interface is still supported, but it could go away in the future; new code should use the read_proc interface instead.

创建 /proc 文件

Creating your /proc file

一旦你有了read_proc 函数定义, 您需要将其连接到/proc层次结构中的一个条目。这是 通过调用 create_proc_read_entry完成

Once you have a read_proc function defined, you need to connect it to an entry in the /proc hierarchy. This is done with a call to create_proc_read_entry :

struct proc_dir_entry *create_proc_read_entry(const char *name,
                              mode_t 模式,struct proc_dir_entry *base,
                              read_proc_t *read_proc, void *data);
struct proc_dir_entry *create_proc_read_entry(const char *name,
                              mode_t mode, struct proc_dir_entry *base, 
                              read_proc_t *read_proc, void *data);

这里,name是要创建的文件的名称, mode是文件的保护掩码(对于系统范围的默认值,可以将其传递为 0),base 表示应在其中创建文件的目录(如果baseNULL,则文件是在/proc根目录中创建 ),read_proc是 实现该文件的read_procdata函数,并且被内核忽略(但传递给 read_proc)。以下是scull使用的调用, 使其/proc函数可用作/proc/scullmem

Here, name is the name of the file to create, mode is the protection mask for the file (it can be passed as 0 for a system-wide default), base indicates the directory in which the file should be created (if base is NULL, the file is created in the /proc root), read_proc is the read_proc function that implements the file, and data is ignored by the kernel (but passed to read_proc). Here is the call used by scull to make its /proc function available as /proc/scullmem:

create_proc_read_entry("scullmem", 0 /* 默认模式 */,
        NULL /* 父目录 */, scull_read_procmem,
        NULL /* 客户端数据 */);
create_proc_read_entry("scullmem", 0 /* default mode */,
        NULL /* parent dir */, scull_read_procmem,
        NULL /* client data */);

在这里,我们直接在/proc下创建一个名为scullmem的文件,具有默认的、世界可读的保护。

Here, we create a file called scullmem directly under /proc, with the default, world-readable protections.

目录项指针可用于创建/proc下的整个目录层次结构。但请注意,只要目录本身已经存在,只需将目录名称作为条目名称的一部分即可将条目放置在/proc的子目录中。例如,一个(经常被忽略的)约定是与设备驱动程序关联的/proc条目应该放在子目录driver/中;scull只需将其名称命名为driver/scullmem即可将其条目放置在那里。

The directory entry pointer can be used to create entire directory hierarchies under /proc. Note, however, that an entry may be more easily placed in a subdirectory of /proc simply by giving the directory name as part of the name of the entry—as long as the directory itself already exists. For example, an (often ignored) convention says that /proc entries associated with device drivers should go in the subdirectory driver/; scull could place its entry there simply by giving its name as driver/scullmem.

当然,卸载模块时应该删除 /proc中的条目。remove_proc_entry是撤消create_proc_read_entry已经执行的操作的函数 :

Entries in /proc, of course, should be removed when the module is unloaded. remove_proc_entry is the function that undoes what create_proc_read_entry already did:

remove_proc_entry("scullmem", NULL /* 父目录 */);
remove_proc_entry("scullmem", NULL /* parent dir */);

未能删除条目可能会导致在不需要的时间进行调用,或者,如果您的模块已卸载,则内核崩溃。

Failure to remove entries can result in calls at unwanted times, or, if your module has been unloaded, kernel crashes.

当使用所示的/proc文件时,您必须记住实现中的一些麻烦 — 毫不奇怪,现在不鼓励使用它。

When using /proc files as shown, you must remember a few nuisances of the implementation—no surprise its use is discouraged nowadays.

最重要的问题是删除/proc 条目。这种删除很可能在文件使用时发生,因为没有与/proc条目关联的所有者,因此使用它们不会影响模块的引用计数。例如,在删除模块之前运行sleep 100 < /proc/myfile即可触发此问题 。

The most important problem is with removal of /proc entries. Such removal may well happen while the file is in use, as there is no owner associated to /proc entries, so using them doesn't act on the module's reference count. This problem is simply triggered by running sleep 100 < /proc/myfile just before removing the module, for example.

另一个问题是关于注册两个同名条目。内核信任驱动程序并且不会检查该名称是否已注册,因此如果您不小心,您可能会得到两个或多个具有相同名称的条目。这是教室中经常发生的问题,并且这些条目在您访问它们时和调用remove_proc_entry时都无法区分。

Another issue is about registering two entries with the same name. The kernel trusts the driver and doesn't check if the name is already registered, so if you are not careful you might end up with two or more entries with the same name. This is a problem known to happen in classrooms, and such entries are indistinguishable, both when you access them and when you call remove_proc_entry.

seq_file 接口

The seq_file interface

正如我们上面指出的, 实施/proc下的大文件有点尴尬。随着时间的推移,当输出量变大时,/proc方法因执行错误而臭名昭著。作为清理 /proc代码并使内核程序员的工作更轻松的一种方式,seq_file添加了该接口。该接口提供了一组简单的函数来实现大型内核虚拟文件。

As we noted above, the implementation of large files under /proc is a little awkward. Over time, /proc methods have become notorious for buggy implementations when the amount of output grows large. As a way of cleaning up the /proc code and making life easier for kernel programmers, the seq_file interface was added. This interface provides a simple set of functions for the implementation of large kernel virtual files.

seq_file界面假定您正在创建一个虚拟文件,该文件逐步执行必须返回到用户空间的一系列项目。要使用seq_file,您必须创建一个简单的“迭代器”对象,该对象可以在序列中建立一个位置、向前一步并输出序列中的一项。听起来可能很复杂,但实际上,过程非常简单。我们将逐步在scull驱动程序中创建/proc文件,以展示它是如何完成的。

The seq_file interface assumes that you are creating a virtual file that steps through a sequence of items that must be returned to user space. To use seq_file, you must create a simple "iterator" object that can establish a position within the sequence, step forward, and output one item in the sequence. It may sound complicated, but, in fact, the process is quite simple. We'll step through the creation of a /proc file in the scull driver to show how it is done.

第一步不可避免地是包含<linux/seq_file.h>。然后,您必须创建四个迭代器方法,分别称为startnextstopshow

The first step, inevitably, is the inclusion of <linux/seq_file.h>. Then you must create four iterator methods, called start, next, stop, and show.

开始_ 方法总是首先被调用。该函数的原型是:

The start method is always called first. The prototype for this function is:

void *start(struct seq_file *sfile, loff_t *pos);
void *start(struct seq_file *sfile, loff_t *pos);

sfile 争论几乎总是可以被忽略。pos是一个整数位置,指示读取应该开始的位置。该职位的解释完全取决于实施;它不必是结果文件中的字节位置。由于seq_file实现通常会逐步执行一系列感兴趣的项目,因此该位置通常被解释为指向序列中下一个项目的光标。scull驱动程序将每个设备解释为序列中的一项,因此传入的pos只是scull_devices数组中的一个索引。因此,scull中使用的 start方法是:

The sfile argument can almost always be ignored. pos is an integer position indicating where the reading should start. The interpretation of the position is entirely up to the implementation; it need not be a byte position in the resulting file. Since seq_file implementations typically step through a sequence of interesting items, the position is often interpreted as a cursor pointing to the next item in the sequence. The scull driver interprets each device as one item in the sequence, so the incoming pos is simply an index into the scull_devices array. Thus, the start method used in scull is:

静态无效* scull_seq_start(结构seq_file * s,loff_t * pos)
{
    if (*pos >= scull_nr_devs)
        返回空值;/* 没有更多内容可读 */
    返回 scull_devices + *pos;
}
static void *scull_seq_start(struct seq_file *s, loff_t *pos)
{
    if (*pos >= scull_nr_devs)
        return NULL;   /* No more to read */
    return scull_devices + *pos;
}

返回值(如果为非NULL)是可由迭代器实现使用的私有值。

The return value, if non-NULL, is a private value that can be used by the iterator implementation.

一个 函数应该将迭代器移动到下一个位置,NULL如果序列中没有剩余内容则返回。该方法的原型是:

The next function should move the iterator to the next position, returning NULL if there is nothing left in the sequence. This method's prototype is:

void *next(struct seq_file *sfile, void *v, loff_t *pos);
void *next(struct seq_file *sfile, void *v, loff_t *pos);

这里,v是上次调用startnext返回的迭代器,pos是文件中的当前位置。 next应该增加 指向的值pos;根据迭代器的工作方式,您可能(尽管可能不会)希望增加pos超过 1。这是scull 的作用:

Here, v is the iterator as returned from the previous call to start or next, and pos is the current position in the file. next should increment the value pointed to by pos; depending on how your iterator works, you might (though probably won't) want to increment pos by more than one. Here's what scull does:

静态无效* scull_seq_next(结构seq_file * s,无效* v,loff_t * pos)
{
    (*位置)++;
    if (*pos >= scull_nr_devs)
        返回空值;
    返回 scull_devices + *pos;
}
static void *scull_seq_next(struct seq_file *s, void *v, loff_t *pos)
{
    (*pos)++;
    if (*pos >= scull_nr_devs)
        return NULL;
    return scull_devices + *pos;
}

当内核完成迭代器后,它会调用 停下来清理:

When the kernel is done with the iterator, it calls stop to clean up:

void stop(struct seq_file *sfile, void *v);
void stop(struct seq_file *sfile, void *v);

scull实现没有清理工作要做,因此它的 stop方法是空的。

The scull implementation has no cleanup work to do, so its stop method is empty.

值得注意的是,根据设计,代码在调用startstopseq_file之间不会休眠或执行其他非原子任务 。您还可以保证在调用start后不久 就会看到一个stop调用。因此,您的启动 方法获取信号量或自旋锁是安全的。只要您的其他方法是原子的,整个调用序列就是原子的。(如果这一段对您来说没有意义,请在阅读下一章后再回来阅读。)seq_file

It is worth noting that the seq_file code, by design, does not sleep or perform other nonatomic tasks between the calls to start and stop. You are also guaranteed to see one stop call sometime shortly after a call to start. Therefore, it is safe for your start method to acquire semaphores or spinlocks. As long as your other seq_file methods are atomic, the whole sequence of calls is atomic. (If this paragraph does not make sense to you, come back to it after you've read the next chapter.)

在这些调用之间,内核调用 show方法实际上向用户空间输出一些有趣的东西。该方法的原型是:

In between these calls, the kernel calls the show method to actually output something interesting to the user space. This method's prototype is:

int show(struct seq_file *sfile, void *v);
int show(struct seq_file *sfile, void *v);

此方法应该按照迭代器指示的顺序为项目创建输出v。它不应该使用 然而, printk;相反,有一组特殊的 seq_file输出函数:

This method should create output for the item in the sequence indicated by the iterator v. It should not use printk, however; instead, there is a special set of functions for seq_file output:

int seq_printf(struct seq_file *sfile, const char *fmt, ...);
int seq_printf(struct seq_file *sfile, const char *fmt, ...);

这与printf 的实现等效seq_file;它采用通常的格式字符串和附加值参数。但是,您还必须将给定的结构传递seq_fileshow函数。如果seq_printf 返回非零值,则意味着缓冲区已填满,并且输出将被丢弃。然而,大多数实现都会忽略返回值。

This is the printf equivalent for seq_file implementations; it takes the usual format string and additional value arguments. You must also pass it the seq_file structure given to the show function, however. If seq_printf returns a nonzero value, it means that the buffer has filled, and output is being discarded. Most implementations ignore the return value, however.

int seq_putc(struct seq_file *sfile, char c);

int seq_puts(struct seq_file *sfile, const char *s);
int seq_putc(struct seq_file *sfile, char c);

int seq_puts(struct seq_file *sfile, const char *s);

这些是用户空间putcput函数的等效项。

These are the equivalents of the user-space putc and puts functions.

int seq_escape(struct seq_file *m, const char *s, const char *esc);
int seq_escape(struct seq_file *m, const char *s, const char *esc);

此函数等效于seq_puts,不同之处在于,其中s也找到的任何字符都esc以八进制格式打印。一个常见的值esc是“ \t\n\\”,它可以防止嵌入的空格弄乱输出并可能混淆 shell 脚本。

This function is equivalent to seq_puts with the exception that any character in s that is also found in esc is printed in octal format. A common value for esc is " \t\n\\", which keeps embedded white space from messing up the output and possibly confusing shell scripts.

int seq_path(struct seq_file *sfile, struct vfsmount *m, struct dentry

*dentry, char *esc);
int seq_path(struct seq_file *sfile, struct vfsmount *m, struct dentry

*dentry, char *esc);

此函数可用于输出与给定目录条目关联的文件名。它不太可能在设备驱动程序中有用;为了完整起见,我们将其包含在这里。

This function can be used for outputting the file name associated with a given directory entry. It is unlikely to be useful in device drivers; we have included it here for completeness.

回到我们的例子;scull中使用的 show方法是:

Getting back to our example; the show method used in scull is:

静态 int scull_seq_show(struct seq_file *s, void *v)
{
    struct scull_dev *dev = (struct scull_dev *) v;
    结构 scull_qset *d;
    整数我;

    if (down_interruptible(&dev->sem))
        返回-ERESTARTSYS;
    seq_printf(s, "\n设备 %i: qset %i, q %i, sz %li\n",
            (int) (dev - scull_devices), dev->qset,
            dev->量子,dev->大小);
    for (d = dev->data; d; d = d->next) { /* 扫描列表 */
        seq_printf(s, " 项目位于 %p, qset 位于 %p\n", d, d->data);
        if (d->data && !d->next) /* 仅转储最后一项 */
            for (i = 0; i < dev->qset; i++) {
                if (d->数据[i])
                    seq_printf(s, " % 4i: %8p\n",
                            i, d->数据[i]);
            }
    }
    向上(&dev->sem);
    返回0;
}
static int scull_seq_show(struct seq_file *s, void *v)
{
    struct scull_dev *dev = (struct scull_dev *) v;
    struct scull_qset *d;
    int i;

    if (down_interruptible(&dev->sem))
        return -ERESTARTSYS;
    seq_printf(s, "\nDevice %i: qset %i, q %i, sz %li\n",
            (int) (dev - scull_devices), dev->qset,
            dev->quantum, dev->size);
    for (d = dev->data; d; d = d->next) { /* scan the list */
        seq_printf(s, "  item at %p, qset at %p\n", d, d->data);
        if (d->data && !d->next) /* dump only the last item */
            for (i = 0; i < dev->qset; i++) {
                if (d->data[i])
                    seq_printf(s, "    % 4i: %8p\n",
                            i, d->data[i]);
            }
    }
    up(&dev->sem);
    return 0;
}

在这里,我们最终解释了我们的“迭代器”值,它只是一个指向 scull_dev结构的指针。

Here, we finally interpret our "iterator" value, which is simply a pointer to a scull_dev structure.

现在它拥有了一整套迭代器操作,scull必须将它们打包并将它们连接到/proc中的文件。第一步是通过填充结构来完成seq_operations

Now that it has a full set of iterator operations, scull must package them up and connect them to a file in /proc. The first step is done by filling in a seq_operations structure:

静态结构 seq_operations scull_seq_ops = {
    .start = scull_seq_start,
    .next = scull_seq_next,
    .stop = scull_seq_stop,
    .show = scull_seq_show
};
static struct seq_operations scull_seq_ops = {
    .start = scull_seq_start,
    .next  = scull_seq_next,
    .stop  = scull_seq_stop,
    .show  = scull_seq_show
};

有了这个结构,我们必须创建一个内核可以理解的文件实现。我们不使用前面描述的read_proc方法;使用时seq_file,最好连接到稍低级别的/proc 。这意味着创建一个file_operations结构(是的,与字符驱动程序使用的结构相同),实现内核处理文件读取和查找所需的所有操作。幸运的是,这个任务很简单。第一步是创建一个将文件连接到操作的openseq_file方法:

With that structure in place, we must create a file implementation that the kernel understands. We do not use the read_proc method described previously; when using seq_file, it is best to connect in to /proc at a slightly lower level. That means creating a file_operations structure (yes, the same structure used for char drivers) implementing all of the operations needed by the kernel to handle reads and seeks on the file. Fortunately, this task is straightforward. The first step is to create an open method that connects the file to the seq_file operations:

static int scull_proc_open(结构 inode *inode, 结构文件 *file)
{
    返回 seq_open(文件, &scull_seq_ops);
}
static int scull_proc_open(struct inode *inode, struct file *file)
{
    return seq_open(file, &scull_seq_ops);
}

对seq_open的调用将file结构与上面定义的序列操作连接起来。事实证明, 打开是我们必须自己实现的唯一文件操作,因此我们现在可以设置我们的file_operations 结构:

The call to seq_open connects the file structure with our sequence operations defined above. As it turns out, open is the only file operation we must implement ourselves, so we can now set up our file_operations structure:

静态结构 file_operations scull_proc_ops = {
    .owner = THIS_MODULE,
    .open = scull_proc_open,
    .read = seq_read,
    .llseek = seq_lseek,
    .release = seq_release
};
static struct file_operations scull_proc_ops = {
    .owner   = THIS_MODULE,
    .open    = scull_proc_open,
    .read    = seq_read,
    .llseek  = seq_lseek,
    .release = seq_release
};

在这里,我们指定了自己的open方法,但使用预设方法 seq_readseq_lseekseq_release来完成其他操作。

Here we specify our own open method, but use the canned methods seq_read, seq_lseek, and seq_release for everything else.

最后一步是在/proc中创建实际文件:

The final step is to create the actual file in /proc:

条目 = create_proc_entry("scullseq", 0, NULL);
如果(输入)
    入口->proc_fops = &scull_proc_ops;
entry = create_proc_entry("scullseq", 0, NULL);
if (entry)
    entry->proc_fops = &scull_proc_ops;

我们不使用create_proc_read_entry,而是调用较低级别的create_proc_entry,它具有以下原型:

Rather than using create_proc_read_entry, we call the lower-level create_proc_entry, which has this prototype:

struct proc_dir_entry *create_proc_entry(const char *name,
                              mode_t模式,
                              结构 proc_dir_entry *parent);
struct proc_dir_entry *create_proc_entry(const char *name,
                              mode_t mode, 
                              struct proc_dir_entry *parent);

这些参数与create_proc_read_entry中的等效参数相同 :文件名、其保护以及父目录。

The arguments are the same as their equivalents in create_proc_read_entry: the name of the file, its protections, and the parent directory.

通过上面的代码,scull有了一个新的/proc条目,它看起来很像前一个条目。然而,它是优越的,因为无论其输出有多大,它都可以正常工作,它可以正确处理寻道,并且通常更易于阅读和维护。我们建议使用 seq_file来实施 包含极少量行的文件 输出。

With the above code, scull has a new /proc entry that looks much like the previous one. It is superior, however, because it works regardless of how large its output becomes, it handles seeks properly, and it is generally easier to read and maintain. We recommend the use of seq_file for the implementation of files that contain more than a very small number of lines of output.

ioctl 方法

The ioctl Method

ioctl,我们向您展示如何操作 在第 6 章中使用的是一个作用于文件描述符的系统调用;它接收一个标识要执行的命令的数字和(可选)另一个参数,通常是指针。作为使用/proc文件系统的替代方案,您可以实现一些专为调试而定制的ioctl命令。这些命令可以将相关数据结构从驱动程序复制到用户空间,您可以在其中检查它们。

ioctl, which we show you how to use in Chapter 6, is a system call that acts on a file descriptor; it receives a number that identifies a command to be performed and (optionally) another argument, usually a pointer. As an alternative to using the /proc filesystem, you can implement a few ioctl commands tailored for debugging. These commands can copy relevant data structures from the driver to user space where you can examine them.

使用ioctl这种方式获取信息比使用/proc更困难,因为您需要另一个程序来发出ioctl并显示结果。该程序必须编写、编译并与您正在测试的模块保持同步。另一方面,驱动程序端代码比实现/proc文件所需的代码更容易。

Using ioctl this way to get information is somewhat more difficult than using /proc, because you need another program to issue the ioctl and display the results. This program must be written, compiled, and kept in sync with the module you're testing. On the other hand, the driver-side code can be easier than what is needed to implement a /proc file.

有时ioctl是获取信息的最佳方式,因为它比读取/proc运行得更快。如果在将数据写入屏幕之前必须对数据执行某些操作,则以二进制形式检索数据比读取文本文件更有效。此外, ioctl不需要将数据分割成小于页面的片段。

There are times when ioctl is the best way to get information, because it runs faster than reading /proc. If some work must be performed on the data before it's written to the screen, retrieving the data in binary form is more efficient than reading a text file. In addition, ioctl doesn't require splitting data into fragments smaller than a page.

ioctl方法的另一个有趣的优点 是,即使调试将被禁用,信息检索命令也可以保留在驱动程序中。与/proc文件不同,任何查看目录的人都可以看到该文件(很多人可能想知道“那个奇怪的文件是什么”),未记录的ioctl命令可能不会被注意到。此外,如果发生奇怪的事情,他们仍然会在那里 司机。唯一的缺点是模块会稍大一些。

Another interesting advantage of the ioctl approach is that information-retrieval commands can be left in the driver even when debugging would otherwise be disabled. Unlike a /proc file, which is visible to anyone who looks in the directory (and too many people are likely to wonder "what that strange file is"), undocumented ioctl commands are likely to remain unnoticed. In addition, they will still be there should something weird happen to the driver. The only drawback is that the module will be slightly bigger.

通过观察进行调试

Debugging by Watching

有时会出现小问题 可以通过观察用户空间中应用程序的行为来跟踪。观看节目还有助于建立驾驶员正确工作的信心。例如,在查看 scull 的读取实现对不同数据量的读取请求的反应后,我们对scull充满信心。

Sometimes minor problems can be tracked down by watching the behavior of an application in user space. Watching programs can also help in building confidence that a driver is working correctly. For example, we were able to feel confident about scull after looking at how its read implementation reacted to read requests for different amounts of data.

有多种方法可以观察用户空间程序的工作情况。您可以在其上运行调试器以逐步执行其功能、添加打印语句或在strace下运行程序 。在这里,我们将只讨论最后一种技术,当真正的目标是检查内核代码时,这是最有趣的。

There are various ways to watch a user-space program working. You can run a debugger on it to step through its functions, add print statements, or run the program under strace. Here we'll discuss just the last technique, which is most interesting when the real goal is examining kernel code.

斯特雷斯_ 命令是一个强大的工具,它显示用户空间程序发出的所有系统调用。它不仅显示调用,还可以以符号形式显示调用的参数及其返回值。当系统调用失败时,会显示错误的符号值(例如 ENOMEM)和相应的字符串( )。strace有很多命令行选项;其中最有用的是-t显示每次调用执行的时间, -T显示调用所花费的时间,-e限制跟踪的调用类型, -o将输出重定向到文件。默认情况下, Out of memorystrace打印 上的跟踪信息stderr

The strace command is a powerful tool that shows all the system calls issued by a user-space program. Not only does it show the calls, but it can also show the arguments to the calls and their return values in symbolic form. When a system call fails, both the symbolic value of the error (e.g., ENOMEM) and the corresponding string (Out of memory) are displayed. strace has many command-line options; the most useful of which are -t to display the time when each call is executed, -T to display the time spent in the call, -e to limit the types of calls traced, and -o to redirect the output to a file. By default, strace prints tracing information on stderr.

strace从内核本身接收信息。这意味着无论程序是否使用调试支持( gcc的-g选项)进行编译以及是否被剥离,都可以对其进行跟踪。您还可以将跟踪附加到正在运行的进程,类似于调试器连接到正在运行的进程并控制它的方式。

strace receives information from the kernel itself. This means that a program can be traced regardless of whether or not it was compiled with debugging support (the -g option to gcc) and whether or not it is stripped. You can also attach tracing to a running process, similar to the way a debugger can connect to a running process and control it.

跟踪信息通常用于支持发送给应用程序开发人员的错误报告,但它对于内核程序员来说也非常宝贵。我们已经了解了驱动程序代码如何通过响应系统调用来执行;strace允许我们检查每次调用的输入和输出数据的一致性。

The trace information is often used to support bug reports sent to application developers, but it's also invaluable to kernel programmers. We've seen how driver code executes by reacting to system calls; strace allows us to check the consistency of input and output data of each call.

例如,以下屏幕转储显示了运行命令strace ls /dev > /dev/scull0的(大部分)最后几行:

For example, the following screen dump shows (most of) the last lines of running the command strace ls /dev > /dev/scull0 :

打开(“/ dev”,O_RDONLY | O_NONBLOCK | O_LARGEFILE | O_DIRECTORY)= 3
fstat64(3, {st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0
fcntl64(3, F_SETFD, FD_CLOEXEC) = 0
getdents64(3, /* 141 条 */, 4096) = 4088
[...]
getdents64(3, /* 0 条 */, 4096) = 0
关闭(3) = 0
[...]
fstat64(1, {st_mode=S_IFCHR|0664, st_rdev=makedev(254, 0), ...}) = 0
写(1,“MAKEDEV\nadmmidi0\nadmmidi1\nadmmid”...,4096)= 4000
写(1,“b\nptywc\nptywd\nptywe\nptywf\nptyx0\n”...,96)= 96
写(1,“b\nptyxc\nptyxd\nptyxe\nptyxf\nptyy0\n”...,4096)= 3904
写(1,“s17\nvcs18\nvcs19\nvcs2\nvcs20\nvcs21”...,192)= 192
写(1,“\nvcs47\nvcs48\nvcs49\nvcs5\nvcs50\nvc”..., 673) = 673
关闭(1) = 0
退出组(0)=?
open("/dev", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 3
fstat64(3, {st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0
fcntl64(3, F_SETFD, FD_CLOEXEC)         = 0
getdents64(3, /* 141 entries */, 4096)  = 4088
[...]
getdents64(3, /* 0 entries */, 4096)    = 0
close(3)                                = 0
[...]
fstat64(1, {st_mode=S_IFCHR|0664, st_rdev=makedev(254, 0), ...}) = 0
write(1, "MAKEDEV\nadmmidi0\nadmmidi1\nadmmid"..., 4096) = 4000
write(1, "b\nptywc\nptywd\nptywe\nptywf\nptyx0\n"..., 96) = 96
write(1, "b\nptyxc\nptyxd\nptyxe\nptyxf\nptyy0\n"..., 4096) = 3904
write(1, "s17\nvcs18\nvcs19\nvcs2\nvcs20\nvcs21"..., 192) = 192
write(1, "\nvcs47\nvcs48\nvcs49\nvcs5\nvcs50\nvc"..., 673) = 673
close(1)                                = 0
exit_group(0)                           = ?

从第一次写入调用可以明显看出,在ls 完成对目标目录的查找后,它尝试写入 4 KB。奇怪的是(对于 ls),只写入了 4000 字节,并且重试了该操作。然而,我们知道scull中的写入实现 一次写入一个量子,因此我们可以预期部分写入。几步之后,一切扫清,程序成功退出。

It's apparent from the first write call that after ls finished looking in the target directory, it tried to write 4 KB. Strangely (for ls), only 4000 bytes were written, and the operation was retried. However, we know that the write implementation in scull writes a single quantum at a time, so we could have expected the partial write. After a few steps, everything sweeps through, and the program exits successfully.

作为另一个例子,让我们读取scull设备 (使用 wc命令):

As another example, let's read the scull device (using the wc command):

[...]
打开(“/ dev / scull0”,O_RDONLY | O_LARGEFILE)= 3
fstat64(3, {st_mode=S_IFCHR|0664, st_rdev=makedev(254, 0), ...}) = 0
读取(3,“MAKEDEV\nadmmidi0\nadmmidi1\nadmmid”...,16384)= 4000
读取(3,“b\nptywc\nptywd\nptywe\nptywf\nptyx0\n”...,16384)= 4000
读取(3,“s17\nvcs18\nvcs19\nvcs2\nvcs20\nvcs21”...,16384)= 865
读取(3,“”,16384)= 0
fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
写(1, "8865 /dev/scull0\n", 17) = 17
关闭(3) = 0
退出组(0)=?
[...]
open("/dev/scull0", O_RDONLY|O_LARGEFILE) = 3
fstat64(3, {st_mode=S_IFCHR|0664, st_rdev=makedev(254, 0), ...}) = 0
read(3, "MAKEDEV\nadmmidi0\nadmmidi1\nadmmid"..., 16384) = 4000
read(3, "b\nptywc\nptywd\nptywe\nptywf\nptyx0\n"..., 16384) = 4000
read(3, "s17\nvcs18\nvcs19\nvcs2\nvcs20\nvcs21"..., 16384) = 865
read(3, "", 16384)                      = 0
fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
write(1, "8865 /dev/scull0\n", 17)      = 17
close(3)                                = 0
exit_group(0)                           = ?

正如预期的那样,read一次只能检索 4000 个字节,但数据总量与上一个示例中写入的数据总量相同。值得注意的是,与之前的跟踪不同,此示例中的重试是如何组织的。wc针对快速读取进行了优化,因此绕过了标准库,尝试通过单个系统调用读取更多数据。您可以从跟踪中的行看到 wcread如何 尝试一次读取 16 KB。

As expected, read is able to retrieve only 4000 bytes at a time, but the total amount of data is the same that was written in the previous example. It's interesting to note how retries are organized in this example, as opposed to the previous trace. wc is optimized for fast reading and, therefore, bypasses the standard library, trying to read more data with a single system call. You can see from the read lines in the trace how wc tried to read 16 KB at a time.

Linux 专家可以在strace的输出中找到许多有用的信息 。如果您对所有符号感到厌烦,您可以限制自己只观察文件方法(openread等)如何与该efile标志一起工作。

Linux experts can find much useful information in the output of strace. If you're put off by all the symbols, you can limit yourself to watching how the file methods (open, read, and so on) work with the efile flag.

就我个人而言,我们发现strace对于查明系统调用中的运行时错误最有用。常常是 应用程序或演示程序中的perror调用不够详细,不足以用于调试,并且能够准确判断哪个系统调用的参数触发了错误可能会很有帮助。

Personally, we find strace most useful for pinpointing runtime errors from system calls. Often the perror call in the application or demo program isn't verbose enough to be useful for debugging, and being able to tell exactly which arguments to which system call triggered the error can be a great help.

调试系统故障

Debugging System Faults

即使你已经使用了所有 监控和调试技术,有时驱动程序中仍存在错误,并且执行驱动程序时系统会出现故障。发生这种情况时,能够收集尽可能多的信息来解决问题非常重要。

Even if you've used all the monitoring and debugging techniques, sometimes bugs remain in the driver, and the system faults when the driver is executed. When this happens, it's important to be able to collect as much information as possible to solve the problem.

请注意,“故障”并不意味着“恐慌”。Linux 代码足够健壮,可以优雅地响应大多数错误:错误通常会导致系统继续工作时当前进程的破坏。如果故障发生在进程上下文之外或者系统的某些重要部分受到损害,系统可能会出现恐慌。但是,当问题是由于驱动程序错误引起时,通常只会导致不幸使用该驱动程序的进程突然死亡。当进程被销毁时,唯一不可恢复的损害是分配给进程上下文的一些内存丢失;例如,驱动程序通过 kmalloc分配的动态列表可能会丢失。但是,由于内核 在进程终止时调用任何打开设备的关闭操作,因此您的驱动程序可以释放由open方法分配的内容。

Note that "fault" doesn't mean "panic." The Linux code is robust enough to respond gracefully to most errors: a fault usually results in the destruction of the current process while the system goes on working. The system can panic, and it may if a fault happens outside of a process's context or if some vital part of the system is compromised. But when the problem is due to a driver error, it usually results only in the sudden death of the process unlucky enough to be using the driver. The only unrecoverable damage when a process is destroyed is that some memory allocated to the process's context is lost; for instance, dynamic lists allocated by the driver through kmalloc might be lost. However, since the kernel calls the close operation for any open device when a process dies, your driver can release what was allocated by the open method.

尽管 oops 通常不会导致整个系统崩溃,但您可能会发现自己需要在发生这种情况后重新启动。有缺陷的驱动程序可能会使硬件处于不可用状态,使内核资源处于不一致状态,或者在最坏的情况下,在随机位置损坏内核内存。通常,您可以简单地卸载有问题的驱动程序,然后在“哎呀”之后重试。但是,如果您发现任何表明系统整体状况不佳的情况,那么最好的选择通常是立即重新启动。

Even though an oops usually does not bring down the entire system, you may well find yourself needing to reboot after one happens. A buggy driver can leave hardware in an unusable state, leave kernel resources in an inconsistent state, or, in the worst case, corrupt kernel memory in random places. Often you can simply unload your buggy driver and try again after an oops. If, however, you see anything that suggests that the system as a whole is not well, your best bet is usually to reboot immediately.

我们已经说过,当内核代码行为不当时,控制台上会打印一条信息性消息。下一节将解释如何解码和使用此类消息。尽管处理器转储对新手来说显得相当晦涩,但它们充满了有趣的信息,通常足以查明程序错误,而无需进行额外的测试。

We've already said that when kernel code misbehaves, an informative message is printed on the console. The next section explains how to decode and use such messages. Even though they appear rather obscure to the novice, processor dumps are full of interesting information, often sufficient to pinpoint a program bug without the need for additional testing.

哎呀消息

Oops Messages

大多数错误都表现在 NULL指针取消引用或使用其他不正确的指针值。此类错误的通常结果是一条 oops 消息。

Most bugs show themselves in NULL pointer dereferences or by the use of other incorrect pointer values. The usual outcome of such bugs is an oops message.

处理器使用的几乎所有地址都是虚拟地址,并通过复杂的页表结构映射到物理地址(内存管理子系统本身使用的物理地址除外)。当无效指针被取消引用时,分页机制无法将指针映射到物理地址,并且处理器向操作系统发出页面错误信号。如果地址无效,内核将无法“调入”丢失的地址;如果处理器处于管理模式时发生这种情况,它(通常)会生成一个 oops。

Almost any address used by the processor is a virtual address and is mapped to physical addresses through a complex structure of page tables (the exceptions are physical addresses used with the memory management subsystem itself). When an invalid pointer is dereferenced, the paging mechanism fails to map the pointer to a physical address, and the processor signals a page fault to the operating system. If the address is not valid, the kernel is not able to "page in" the missing address; it (usually) generates an oops if this happens while the processor is in supervisor mode.

oops 显示故障时的处理器状态,包括 CPU 寄存器的内容和其他看似难以理解的信息。该消息由错误处理程序 ( arch/*/kernel/traps.c ) 中的printk语句生成,并按照前面第 4.2.1 节中所述进行分派。

An oops displays the processor status at the time of the fault, including the contents of the CPU registers and other seemingly incomprehensible information. The message is generated by printk statements in the fault handler (arch/*/kernel/traps.c) and is dispatched as described earlier in Section 4.2.1).

让我们看一下这样的一条消息。NULL以下是在运行 2.6 版内核的 PC 上取消引用指针的结果。这里最相关的信息是指令指针(EIP),即错误指令的地址。

Let's look at one such message. Here's what results from dereferencing a NULL pointer on a PC running Version 2.6 of the kernel. The most relevant information here is the instruction pointer (EIP), the address of the faulty instruction.

无法处理虚拟地址 00000000 处的内核 NULL 指针取消引用
 打印eip:
d083a064
哎呀:0002 [#1]
表面活性剂
中央处理器:0
EIP: 0060:[<d083a064>] 未受污染
EFLAGS:00010246 (2.6.6)
EIP位于faulty_write+0x4/0x10 [故障]
eax:00000000 ebx:00000000 ecx:00000000 edx:00000000
ESI: cf8b2460 edi: cf8b2480 ebp: 00000005 esp: c31c5f74
DS:007B ES:007B SS:0068
进程bash(pid:2086,threadinfo = c31c4000任务= cfa0a6c0)
堆栈:c0150558 cf8b2460 080e9408 00000005 cf8b2480 00000000 cf8b2460 cf8b2460
       ffffff7 080e9408 c31c4000 c0150682 cf8b2460 080e9408 00000005 cf8b2480
       00000000 00000001 00000005 c0103f8f 00000001 080e9408 00000005 00000005
调用轨迹:
 [<c0150558>] vfs_write+0xb8/0x130
 [<c0150682>] sys_write+0x42/0x70
 [<c0103f8f>] syscall_call+0x7/0xb

代码:89 15 00 00 00 00 c3 90 8d 74 26 00 83 ec 0c b8 00 a6 83 d0
Unable to handle kernel NULL pointer dereference at virtual address 00000000
 printing eip:
d083a064
Oops: 0002 [#1]
SMP 
CPU:    0
EIP:    0060:[<d083a064>]    Not tainted
EFLAGS: 00010246   (2.6.6) 
EIP is at faulty_write+0x4/0x10 [faulty]
eax: 00000000   ebx: 00000000   ecx: 00000000   edx: 00000000
esi: cf8b2460   edi: cf8b2480   ebp: 00000005   esp: c31c5f74
ds: 007b   es: 007b   ss: 0068
Process bash (pid: 2086, threadinfo=c31c4000 task=cfa0a6c0)
Stack: c0150558 cf8b2460 080e9408 00000005 cf8b2480 00000000 cf8b2460 cf8b2460 
       fffffff7 080e9408 c31c4000 c0150682 cf8b2460 080e9408 00000005 cf8b2480 
       00000000 00000001 00000005 c0103f8f 00000001 080e9408 00000005 00000005 
Call Trace:
 [<c0150558>] vfs_write+0xb8/0x130
 [<c0150682>] sys_write+0x42/0x70
 [<c0103f8f>] syscall_call+0x7/0xb

Code: 89 15 00 00 00 00 c3 90 8d 74 26 00 83 ec 0c b8 00 a6 83 d0

此消息是通过写入故障所拥有的设备而生成的 模块,故意构建的用于演示失败的模块。写入的实现 failure.c的方法很简单:

This message was generated by writing to a device owned by the faulty module, a module built deliberately to demonstrate failures. The implementation of the write method of faulty.c is trivial:

ssize_t failure_write(结构文件*filp,const char _ _user *buf,size_t计数,
        loff_t *位置)
{
    /* 通过取消引用 NULL 指针来犯一个简单的错误 */
    *(int *)0 = 0;
    返回0;
}
ssize_t faulty_write (struct file *filp, const char _ _user *buf, size_t count,
        loff_t *pos)
{
    /* make a simple fault by dereferencing a NULL pointer */
    *(int *)0 = 0;
    return 0;
}

正如您所看到的,我们在这里所做的是取消引用NULL指针。由于0永远不是有效的指针值,因此会发生错误,内核会将其转换为前面显示的 oops 消息。然后调用进程被终止。

As you can see, what we do here is dereference a NULL pointer. Since 0 is never a valid pointer value, a fault occurs, which the kernel turns into the oops message shown earlier. The calling process is then killed.

故障模块的故障情况不同 读取实现:

The faulty module has a different fault condition in its read implementation:

ssize_t failedy_read(struct file *filp, char _ _user *buf,
            size_t 计数,loff_t *pos)
{
    int ret;
    字符 stack_buf[4];

    /* 让我们尝试一下缓冲区溢出 */
    memset(stack_buf, 0xff, 20);
    如果(计数 > 4)
        计数 = 4;/* 复制4个字节给用户 */
    ret = copy_to_user(buf, stack_buf, 计数);
    如果(!ret)
        返回计数;
    返回ret;
}
ssize_t faulty_read(struct file *filp, char _ _user *buf,
            size_t count, loff_t *pos)
{
    int ret;
    char stack_buf[4];

    /* Let's try a buffer overflow  */
    memset(stack_buf, 0xff, 20);
    if (count > 4)
        count = 4; /* copy 4 bytes to the user */
    ret = copy_to_user(buf, stack_buf, count);
    if (!ret)
        return count;
    return ret;
}

该方法将字符串复制到局部变量中;不幸的是,该字符串比目标数组长。当函数返回时,所产生的缓冲区溢出会导致 oops。由于该return指令将指令指针带到了无处可归的地方,因此这种故障更难追踪,您可能会得到如下内容:

This method copies a string into a local variable; unfortunately, the string is longer than the destination array. The resulting buffer overflow causes an oops when the function returns. Since the return instruction brings the instruction pointer to nowhere land, this kind of fault is much harder to trace, and you can get something such as the following:

电子IP: 0010:[<00000000>]
无法处理虚拟地址 ffffffff 处的内核分页请求
 打印eip:
噗噗噗
哎呀:0000 [#5]
表面活性剂
中央处理器:0
EIP: 0060:[<ffffffff>] 未受污染
EFLAGS:00010296 (2.6.6)
EIP位于0xffffffff
eax: 0000000c ebx: ffffffff ecx: 00000000 edx: bfffda7c
ESI: cf434f00 edi: ffffffff ebp: 00002000 esp: c27fff78
DS:007B ES:007B SS:0068
进程头(pid:2331,threadinfo = c27fe000任务= c3226150)
堆栈:ffffffff bfffda70 00002000 cf434f20 00000001 00000286 cf434f00 ffffff7
       bfffda70 c27fe000 c0150612 cf434f00 bfffda70 00002000 cf434f20 00000000
       00000003 00002000 c0103f8f 00000003 bfffda70 00002000 00002000 bfffda70
调用轨迹:
 [<c0150612>] sys_read+0x42/0x70
 [<c0103f8f>] syscall_call+0x7/0xb

代码:EIP 值错误。
EIP:    0010:[<00000000>]
Unable to handle kernel paging request at virtual address ffffffff
 printing eip:
ffffffff
Oops: 0000 [#5]
SMP 
CPU:    0
EIP:    0060:[<ffffffff>]    Not tainted
EFLAGS: 00010296   (2.6.6) 
EIP is at 0xffffffff
eax: 0000000c   ebx: ffffffff   ecx: 00000000   edx: bfffda7c
esi: cf434f00   edi: ffffffff   ebp: 00002000   esp: c27fff78
ds: 007b   es: 007b   ss: 0068
Process head (pid: 2331, threadinfo=c27fe000 task=c3226150)
Stack: ffffffff bfffda70 00002000 cf434f20 00000001 00000286 cf434f00 fffffff7 
       bfffda70 c27fe000 c0150612 cf434f00 bfffda70 00002000 cf434f20 00000000 
       00000003 00002000 c0103f8f 00000003 bfffda70 00002000 00002000 bfffda70 
Call Trace:
 [<c0150612>] sys_read+0x42/0x70
 [<c0103f8f>] syscall_call+0x7/0xb

Code:  Bad EIP value.

在这种情况下,我们只能看到部分调用堆栈(缺少vfs_readfatal_read ),并且内核抱怨“EIP 值错误”。ffffffff该投诉以及开头列出的违规地址 ( ) 都暗示内核堆栈已损坏。

In this case, we see only part of the call stack (vfs_read and faulty_read are missing), and the kernel complains about a "bad EIP value." That complaint, and the offending address (ffffffff) listed at the beginning are both hints that the kernel stack has been corrupted.

一般来说,当你遇到oops时,首先要做的就是查看问题发生的位置,该位置通常与调用堆栈分开列出。在上面显示的第一个 oops 中,相关行是:

In general, when you are confronted with an oops, the first thing to do is to look at the location where the problem happened, which is usually listed separately from the call stack. In the first oops shown above, the relevant line is:

EIP位于faulty_write+0x4/0x10 [故障]
EIP is at faulty_write+0x4/0x10 [faulty]

这里我们看到我们在函数faulty_write中 ,它位于有故障的模块中(在方括号中列出)。十六进制数字表明指令指针位于函数中的 4 个字节中,该函数看起来有 10(十六进制)字节长。通常这足以找出问题所在。

Here we see that we were in the function faulty_write , which is located in the faulty module (which is listed in square brackets). The hex numbers indicate that the instruction pointer was 4 bytes into the function, which appears to be 10 (hex) bytes long. Often that is enough to figure out what the problem is.

如果您需要更多信息,调用堆栈会向您显示如何到达崩溃的地方。堆栈本身以十六进制形式打印;通过一些工作,您通常可以从堆栈列表中确定局部变量和函数参数的值。经验丰富的内核开发人员可以从一定程度的模式识别中受益;例如,如果我们查看 错误读取中的堆栈列表 哎呀:

If you need more information, the call stack shows you how you got to where things fell apart. The stack itself is printed in hex form; with a bit of work, you can often determine the values of local variables and function parameters from the stack listing. Experienced kernel developers can benefit from a certain amount of pattern recognition here; for example, if we look at the stack listing from the faulty_read oops:

堆栈:ffffffff bfffda70 00002000 cf434f20 00000001 00000286 cf434f00 ffffff7
       bfffda70 c27fe000 c0150612 cf434f00 bfffda70 00002000 cf434f20 00000000
       00000003 00002000 c0103f8f 00000003 bfffda70 00002000 00002000 bfffda70
Stack: ffffffff bfffda70 00002000 cf434f20 00000001 00000286 cf434f00 fffffff7 
       bfffda70 c27fe000 c0150612 cf434f00 bfffda70 00002000 cf434f20 00000000 
       00000003 00002000 c0103f8f 00000003 bfffda70 00002000 00002000 bfffda70

ffffffff堆栈顶部的 是我们破坏事物的字符串的一部分。在 x86 架构上,默认情况下,用户空间堆栈从下面开始0xc0000000;因此,重复值 0xbfffda70可能是用户空间堆栈地址;事实上,它是传递给read系统调用的缓冲区地址,每次沿着内核调用链传递时都会进行复制。在 x86 上(同样,默认情况下),内核空间从 开始0xc0000000,因此上面的值几乎肯定是内核空间地址,依此类推。

The ffffffff at the top of the stack is part of our string that broke things. On the x86 architecture, by default, the user-space stack starts just below 0xc0000000; thus, the recurring value 0xbfffda70 is probably a user-space stack address; it is, in fact, the address of the buffer passed to the read system call, replicated each time it is passed down the kernel call chain. On the x86 (again, by default), kernel space starts at 0xc0000000, so values above that are almost certainly kernel-space addresses, and so on.

最后,在查看 oops 列表时,请始终留意本章开头讨论的“slab 中毒”值。因此,例如,如果您在有问题的地址为 时收到内核错误0xa5a5a5a5,则几乎可以肯定您忘记在某处初始化动态内存。

Finally, when looking at oops listings, always be on the lookout for the "slab poisoning" values discussed at the beginning of this chapter. Thus, for example, if you get a kernel oops where the offending address is 0xa5a5a5a5, you are almost certainly forgetting to initialize dynamic memory somewhere.

请注意,仅当您的内核是在打开该选项的情况下构建的时,您才会看到符号调用堆栈(如上所示)CONFIG_KALLSYMS。否则,您会看到一个裸露的十六进制列表,这要少得多 在您以其他方式对其进行解码之前,它很有用。

Please note that you see a symbolic call stack (as shown above) only if your kernel is built with the CONFIG_KALLSYMS option turned on. Otherwise, you see a bare, hexadecimal listing, which is far less useful until you have decoded it in other ways.

系统挂起

System Hangs

尽管内核代码中的大多数错误最终都会以 oops 消息的形式出现,但有时它们可​​能会导致系统完全挂起。如果系统挂起,则不会打印任何消息。例如,如果代码进入无限循环,内核将停止调度,[ 3 ]并且系统不会响应任何操作,包括神奇的 Ctrl-Alt-Del 组合。您有两种选择来处理系统挂起 - 要么提前阻止它们,要么能够在事后调试它们。

Although most bugs in kernel code end up as oops messages, sometimes they can completely hang the system. If the system hangs, no message is printed. For example, if the code enters an endless loop, the kernel stops scheduling,[3] and the system doesn't respond to any action, including the magic Ctrl-Alt-Del combination. You have two choices for dealing with system hangs—either prevent them beforehand or be able to debug them after the fact.

您可以通过在战略点插入计划调用来防止无限循环 。调度 调用(正如您可能猜到的那样)调用调度程序,因此允许其他进程从当前进程窃取 CPU 时间。如果某个进程由于驱动程序中的错误而在内核空间中循环,则调度 调用使您能够在跟踪正在发生的情况后终止该进程。

You can prevent an endless loop by inserting schedule invocations at strategic points. The schedule call (as you might guess) invokes the scheduler and, therefore, allows other processes to steal CPU time from the current process. If a process is looping in kernel space due to a bug in your driver, the schedule calls enable you to kill the process after tracing what is happening.

当然,您应该意识到,对Schedule的任何调用都可能会为您的驱动程序创建一个额外的可重入调用源,因为它允许其他进程运行。假设您在驱动程序中使用了适当的锁定,那么这种重入通常不会成为问题。但是,请确保在驱动程序持有自旋锁时不要调用 Schedule 。

You should be aware, of course, that any call to schedule may create an additional source of reentrant calls to your driver, since it allows other processes to run. This reentrancy should not normally be a problem, assuming that you have used suitable locking in your driver. Be sure, however, not to call schedule any time that your driver is holding a spinlock.

如果您的驱动程序确实挂起系统,并且您不知道在哪里插入计划调用,最好的方法可能是添加一些打印消息并将它们写入控制台(console_loglevel如果需要,可以更改值)。

If your driver really hangs the system, and you don't know where to insert schedule calls, the best way to go may be to add some print messages and write them to the console (by changing the console_loglevel value if need be).

有时系统可能看起来已挂起,但实际上并非如此。例如,如果键盘以某种奇怪的方式保持锁定状态,则可能会发生这种情况。通过查看为此目的而持续运行的程序的输出,可以检测到这些错误挂起。显示器上的时钟或系统负载表是一个很好的状态监视器;只要持续更新,调度程序就在工作。

Sometimes the system may appear to be hung, but it isn't. This can happen, for example, if the keyboard remains locked in some strange way. These false hangs can be detected by looking at the output of a program you keep running for just this purpose. A clock or system load meter on your display is a good status monitor; as long as it continues to update, the scheduler is working.

对于许多锁定来说,一个不可或缺的工具是“神奇的 SysRq 密钥”,它在大多数架构上都可用。Magic SysRq 可以通过 PC 键盘上的 Alt 和 SysRq 键的组合来调用,或者通过其他平台上的其他特殊键来调用(有关详细信息,请参阅Documentation/sysrq.txt),并且也可以在串行控制台上使用。与这两个键一起按下的第三个键可以执行许多有用操作之一:

An indispensable tool for many lockups is the "magic SysRq key," which is available on most architectures. Magic SysRq is invoked with the combination of the Alt and SysRq keys on the PC keyboard, or with other special keys on other platforms (see Documentation/sysrq.txt for details), and is available on the serial console as well. A third key, pressed along with these two, performs one of a number of useful actions:

r
r

关闭键盘原始模式;在崩溃的应用程序(例如 X 服务器)可能使键盘处于奇怪状态的情况下很有用。

Turns off keyboard raw mode; useful in situations where a crashed application (such as the X server) may have left your keyboard in a strange state.

k
k

调用“ 安全注意键”(SAK) 功能。SAK 会终止当前控制台上运行的所有进程,为您留下一个干净的终端。

Invokes the " secure attention key" (SAK) function. SAK kills all processes running on the current console, leaving you with a clean terminal.

s
s

对所有磁盘执行紧急同步。

Performs an emergency synchronization of all disks.

u
u

卸载。尝试以只读模式重新安装所有磁盘。此操作通常在s之后立即调用,在系统出现严重问题时可以节省大量文件系统检查时间。

Umount. Attempts to remount all disks in a read-only mode. This operation, usually invoked immediately after s, can save a lot of filesystem checking time in cases where the system is in serious trouble.

b
b

启动。立即重新启动系统。请务必先同步并重新安装磁盘。

Boot. Immediately reboots the system. Be sure to synchronize and remount the disks first.

p
p

打印处理器寄存器信息。

Prints processor registers information.

t
t

打印当前任务列表。

Prints the current task list.

m
m

打印内存信息。

Prints memory information.

还存在其他神奇的 SysRq 函数;有关完整列表,请参阅内核源代码的Documentation目录中的 sysrq.txt 。请注意,必须在内核配置中显式启用 magic SysRq,并且出于明显的安全原因,大多数发行版都不会启用它。然而,对于用于开发驱动程序的系统来说,启用神奇的 SysRq 值得在其本身中构建新内核。Magic SysRq 可以在运行时使用如下命令禁用:

Other magic SysRq functions exist; see sysrq.txt in the Documentation directory of the kernel source for the full list. Note that magic SysRq must be explicitly enabled in the kernel configuration and that most distributions do not enable it, for obvious security reasons. For a system used to develop drivers, however, enabling magic SysRq is worth the trouble of building a new kernel in itself. Magic SysRq may be disabled at runtime with a command such as the following:

echo 0 > /proc/sys/kernel/sysrq
echo 0 > /proc/sys/kernel/sysrq

如果非特权用户可以访问您的系统键盘,您应该考虑禁用它,以防止意外或故意的损坏。某些以前的内核版本 默认禁用sysrq ,因此您需要在运行时通过向同一个/proc/sys文件写入 1 来启用它。

You should consider disabling it if unprivileged users can reach your system keyboard, to prevent accidental or willing damages. Some previous kernel versions had sysrq disabled by default, so you needed to enable it at runtime by writing 1 to that same /proc/sys file.

sysrq操作非常有用,因此无法访问控制台的系统管理员也可以使用它们。文件/proc/sysrq-trigger是一个只写入口点,您可以通过写入关联的命令字符来触发特定的sysrq操作;然后您可以从内核日志收集任何输出数据。即使 sysrq 在控制台上被禁用, sysrq这个入口点也

The sysrq operations are exceedingly useful, so they have been made available to system administrators who can't reach the console. The file /proc/sysrq-trigger is a write-only entry point, where you can trigger a specific sysrq action by writing the associated command character; you can then collect any output data from the kernel logs. This entry point to sysrq is always working, even if sysrq is disabled on the console.

如果您遇到“实时挂起”,即您的驱动程序陷入循环但整个系统仍在运行,则有一些值得了解的技术。通常,SysRq p函数直接将矛头指向有罪的例程。如果失败,您还可以使用内核分析功能。构建启用分析的内核,并profile=2在命令行上启动它。使用readprofile实用程序重置配置文件计数器 ,然后将驱动程序发送到其循环中。过了一会儿,再次使用 readprofile来查看内核在哪里花费了时间。另一个更高级的替代方案是oprofile,您也可以考虑。文件Documentation/basic_profiling.txt 告诉您开始使用分析器所需了解的所有信息。

If you are experiencing a "live hang," in which your driver is stuck in a loop but the system as a whole is still functioning, there are a couple of techniques worth knowing. Often, the SysRq p function points the finger directly at the guilty routine. Failing that, you can also use the kernel profiling function. Build a kernel with profiling enabled, and boot it with profile=2 on the command line. Reset the profile counters with the readprofile utility, then send your driver into its loop. After a little while, use readprofile again to see where the kernel is spending its time. Another more advanced alternative is oprofile, that you may consider as well. The file Documentation/basic_profiling.txt tells you everything you need to know to get started with the profilers.

当系统挂起时值得使用的一项预防措施是将所有磁盘安装为只读(或卸载它们)。如果磁盘是只读的或未安装的,则不存在损坏文件系统或使其处于不一致状态的风险。另一种可能性是使用通过 NFS(网络文件系统)安装其所有文件系统的计算机。必须在内核中启用“NFS-Root”功能,并且必须在启动时传递特殊参数。在这种情况下,您甚至无需求助于 SysRq 即可避免文件系统损坏,因为文件系统一致性由 NFS 服务器管理,而 NFS 服务器不会崩溃通过您的设备 司机。

One precaution worth using when chasing system hangs is to mount all your disks as read-only (or unmount them). If the disks are read-only or unmounted, there's no risk of damaging the filesystem or leaving it in an inconsistent state. Another possibility is using a computer that mounts all of its filesystems via NFS, the network file system. The "NFS-Root" capability must be enabled in the kernel, and special parameters must be passed at boot time. In this case, you'll avoid filesystem corruption without even resorting to SysRq, because filesystem coherence is managed by the NFS server, which is not brought down by your device driver.

调试器和相关工具

Debuggers and Related Tools

调试的最后手段 模块是 使用调试器单步调试代码,观察变量和机器寄存器的值。这种方法非常耗时,应尽可能避免。尽管如此,通过调试器实现的代码的细粒度视角有时是无价的。

The last resort in debugging modules is using a debugger to step through the code, watching the value of variables and machine registers. This approach is time-consuming and should be avoided whenever possible. Nonetheless, the fine-grained perspective on the code that is achieved through a debugger is sometimes invaluable.

在内核上使用交互式调试器是一个挑战。内核代表系统上的所有进程在自己的地址空间中运行。因此,用户空间调试器提供的许多常见功能(例如断点和单步执行)在内核中更难实现。在本节中,我们将介绍几种调试内核的方法;它们各有优点和缺点。

Using an interactive debugger on the kernel is a challenge. The kernel runs in its own address space on behalf of all the processes on the system. As a result, a number of common capabilities provided by user-space debuggers, such as breakpoints and single-stepping, are harder to come by in the kernel. In this section we look at several ways of debugging the kernel; each of them has advantages and disadvantages.

使用gdb

Using gdb

gdb对于查看系统内部结构非常有用。在这个级别熟练使用调试器需要对 gdb有一定的信心 命令,对目标平台的汇编代码有一定的了解,以及匹配源代码和优化汇编的能力。

gdb can be quite useful for looking at the system internals. Proficient use of the debugger at this level requires some confidence with gdb commands, some understanding of assembly code for the target platform, and the ability to match source code and optimized assembly.

必须像内核是应用程序一样调用调试器。除了指定 ELF 内核映像的文件名之外,您还需要在命令行上提供核心文件的名称。对于正在运行的内核,该核心文件是内核核心映像/proc/kcoregdb的典型调用如下所示:

The debugger must be invoked as though the kernel were an application. In addition to specifying the filename for the ELF kernel image, you need to provide the name of a core file on the command line. For a running kernel, that core file is the kernel core image, /proc/kcore. A typical invocation of gdb looks like the following:

gdb /usr/src/linux/vmlinux /proc/kcore
gdb /usr/src/linux/vmlinux /proc/kcore

第一个参数是未压缩的 ELF 内核可执行文件的名称,而不是 zImagebzImage或专门为引导环境构建的任何内容。

The first argument is the name of the uncompressed ELF kernel executable, not the zImage or bzImage or anything built specifically for the boot environment.

gdb命令行上的第二个参数是核心文件的名称。与/proc中的任何文件一样,/proc/kcore是在读取时生成的。当 read系统调用在/proc文件系统中执行时,它映射到数据生成函数而不是数据检索函数;我们已经在第 4.3.1节中利用了此功能。kcore用于以核心文件的格式表示内核“可执行”;它是一个巨大的文件,因为它代表了整个内核地址空间,对应于所有物理内存。从gdb内部,您可以通过发出标准gdb命令来查看内核变量。例如, p jiffies打印从系统启动到当前时间的时钟滴答数。

The second argument on the gdb command line is the name of the core file. Like any file in /proc, /proc/kcore is generated when it is read. When the read system call executes in the /proc filesystem, it maps to a data-generation function rather than a data-retrieval one; we've already exploited this feature in the section Section 4.3.1. kcore is used to represent the kernel "executable" in the format of a core file; it is a huge file, because it represents the whole kernel address space, which corresponds to all physical memory. From within gdb, you can look at kernel variables by issuing the standard gdb commands. For example, p jiffies prints the number of clock ticks from system boot to the current time.

当你从gdb打印数据时,内核仍在运行,各个数据项在不同时间有不同的值;然而, gdb通过缓存已读取的数据来优化对核心文件的访问。jiffies如果您尝试再次查看该变量,您将得到与之前相同的答案。缓存值以避免额外的磁盘访问对于传统核心文件来说是正确的行为,但在使用“动态”核心映像时很不方便。解决方案是每当您想要刷新 gdb时发出命令 core-file /proc/kcore缓存;调试器准备使用新的核心文件并丢弃所有旧信息。然而,在读取新数据时,您并不总是需要发出 核心文件;gdb以几千字节的块读取核心,并仅缓存它已经引用的块。

When you print data from gdb, the kernel is still running, and the various data items have different values at different times; gdb, however, optimizes access to the core file by caching data that has already been read. If you try to look at the jiffies variable once again, you'll get the same answer as before. Caching values to avoid extra disk access is a correct behavior for conventional core files but is inconvenient when a "dynamic" core image is used. The solution is to issue the command core-file /proc/kcore whenever you want to flush the gdb cache; the debugger gets ready to use a new core file and discards any old information. You won't, however, always need to issue core-file when reading a new datum; gdb reads the core in chunks of a few kilobytes and caches only chunks it has already referenced.

当您使用内核时,gdb通常提供的许多功能不可用。例如,gdb无法修改内核数据;它希望在使用其内存映像之前运行一个在其自己的控制下进行调试的程序。也不可能设置断点或观察点,或单步执行内核函数。

Numerous capabilities normally provided by gdb are not available when you are working with the kernel. For example, gdb is not able to modify kernel data; it expects to be running a program to be debugged under its own control before playing with its memory image. It is also not possible to set breakpoints or watchpoints, or to single-step through kernel functions.

请注意,为了使gdb可以使用符号信息,您必须使用选项集编译内核CONFIG_DEBUG_INFO 。结果是磁盘上有一个更大的内核映像,但是,如果没有这些信息,挖掘内核变量几乎是不可能的。

Note that, in order to have symbol information available for gdb, you must compile your kernel with the CONFIG_DEBUG_INFO option set. The result is a far larger kernel image on disk, but, without that information, digging through kernel variables is almost impossible.

通过可用的调试信息,您可以了解有关内核内部发生的情况的很多信息。gdb愉快地打印出结构、跟踪指针等。然而,更困难的一件事是检查模块。由于模块不是传递给 gdb 的vmlinux映像的一部分,因此调试器对它们一无所知。幸运的是,从内核 2.6.7 开始,可以告诉gdb检查可加载模块所需的知识。

With the debugging information available, you can learn a lot about what is going on inside the kernel. gdb happily prints out structures, follows pointers, etc. One thing that is harder, however, is examining modules. Since modules are not part of the vmlinux image passed to gdb, the debugger knows nothing about them. Fortunately, as of kernel 2.6.7, it is possible to teach gdb what it needs to know to examine loadable modules.

Linux可加载模块是ELF格式的可执行映像;因此,它们被分为许多部分。一个典型的模块可以包含十几个或更多部分,但通常有三个与调试会话相关:

Linux loadable modules are ELF-format executable images; as such, they have been divided up into numerous sections. A typical module can contain a dozen or more sections, but there are typically three that are relevant in a debugging session:

.text
.text

本节包含模块的可执行代码。调试器必须知道该部分在哪里才能进行回溯或设置断点。(在/proc/kcore上运行调试器时,这些操作都不相关,但在使用 kgdb时它们很有用,如下所述)。

This section contains the executable code for the module. The debugger must know where this section is to be able to give tracebacks or set breakpoints. (Neither of these operations is relevant when running the debugger on /proc/kcore, but they can useful when working with kgdb, described below).

.bss

.data
.bss

.data

这两个部分保存模块的变量。任何在编译时未初始化的变量都以 结尾.bss,而那些已初始化的变量则以 结尾.data

These two sections hold the module's variables. Any variable that is not initialized at compile time ends up in .bss, while those that are initialized go into .data.

gdb使用可加载模块需要通知调试器给定模块的部分已加载到何处。该信息可在 sysfs 中的/sys/module下找到。例如,加载scull模块后,目录/sys/module/scull/sections包含名称为 .text的文件;每个文件的内容是该部分的基地址。

Making gdb work with loadable modules requires informing the debugger about where a given module's sections have been loaded. That information is available in sysfs, under /sys/module. For example, after loading the scull module, the directory /sys/module/scull/sections contains files with names such as .text; the content of each file is the base address for that section.

我们现在可以发出一个gdb命令来告诉它我们的模块。我们需要的命令是add-symbol-file;该命令将模块目标文件的名称、.text基地址以及一系列描述任何其他感兴趣的部分放置位置的可选参数作为参数。在深入挖掘 sysfs 中的模块部分数据后,我们可以构造一个命令,例如:

We are now in a position to issue a gdb command telling it about our module. The command we need is add-symbol-file; this command takes as parameters the name of the module object file, the .text base address, and a series of optional parameters describing where any other sections of interest have been put. After digging through the module section data in sysfs, we can construct a command such as:

(gdb)add-symbol-file .../scull.ko 0xd0832000 \
               -s .bss 0xd0837100 \
                       -s .data 0xd0836be0
(gdb) add-symbol-file .../scull.ko 0xd0832000 \
               -s .bss 0xd0837100 \
                       -s .data 0xd0836be0

我们在示例源代码 ( gdbline ) 中包含了一个小脚本,可以为给定模块创建此命令。

We have included a small script in the sample source (gdbline) that can create this command for a given module.

我们现在可以使用gdb来检查可加载模块中的变量。下面是一个取自scull调试会话的简单示例:

We can now use gdb to examine variables in our loadable module. Here is a quick example taken from a scull debugging session:

(gdb)add-symbol-file scull.ko 0xd0832000 \
               -s .bss 0xd0837100 \
                     -s .data 0xd0836be0
从文件“scull.ko”添加符号表
        .text_addr = 0xd0832000
        .bss_地址 = 0xd0837100
        .data_addr = 0xd0836be0
(y 或 n)y
从 scull.ko 读取符号...完成。
(gdb)p scull_devices[0]
$1 = {数据 = 0xcfd66c50,
      量子=4000,
      q 集 = 1000,
      大小=20881,
      访问密钥 = 0,
      ...}
(gdb) add-symbol-file scull.ko 0xd0832000 \
               -s .bss 0xd0837100 \
                     -s .data 0xd0836be0
add symbol table from file "scull.ko" at
        .text_addr = 0xd0832000
        .bss_addr = 0xd0837100
        .data_addr = 0xd0836be0
(y or n) y
Reading symbols from scull.ko...done.
(gdb) p scull_devices[0]
$1 = {data = 0xcfd66c50, 
      quantum = 4000, 
      qset = 1000, 
      size = 20881,
      access_key = 0, 
      ...}

在这里我们看到第一个scull设备当前拥有 20,881 字节。如果我们愿意,我们可以跟踪该data链,或者查看该模块中任何其他感兴趣的内容。

Here we see that the first scull device currently holds 20,881 bytes. If we wanted, we could follow the data chain, or look at anything else of interest in the module.

另一个值得了解的有用技巧是:

One other useful trick worth knowing about is this:

(gdb)print *(address)
(gdb) print *(address)

在这里,填写address;的十六进制地址。输出是与该地址对应的代码的文件和行号。例如,该技术对于找出函数指针真正指向的位置可能很有用。

Here, fill in a hex address for address; the output is a file and line number for the code corresponding to that address. This technique may be useful, for example, to find out where a function pointer really points.

我们仍然无法执行典型的调试任务,例如设置断点或修改数据;要执行这些操作,我们需要使用kdb (接下来将介绍)或kgdb(我们很快就会介绍)之类的工具。

We still cannot perform typical debugging tasks like setting breakpoints or modifying data; to perform those operations, we need to use a tool like kdb (described next) or kgdb (which we get to shortly).

kdb 内核调试器

The kdb Kernel Debugger

许多读者可能想知道为什么内核没有内置任何更高级的调试功能。答案很简单,Linus 不相信交互式调试器。他担心它们会导致糟糕的修复,即修补症状而不是解决问题的真正原因。因此,没有内置调试器。

Many readers may be wondering why the kernel does not have any more advanced debugging features built into it. The answer, quite simply, is that Linus does not believe in interactive debuggers. He fears that they lead to poor fixes, those which patch up symptoms rather than addressing the real cause of problems. Thus, no built-in debuggers.

然而,其他内核开发人员偶尔会看到交互式调试工具的使用。其中一种工具是kdb内置内核调试器,可以从oss.sgi.com作为非官方补丁获得。要使用 kdb,您必须获取补丁(确保获取与您的内核版本匹配的版本),应用它,然后重建并重新安装内核。请注意,在撰写本文时,kdb仅适用于 IA-32 (x86) 系统(尽管 IA-64 的版本在被删除之前在主线内核源代码中存在了一段时间)。

Other kernel developers, however, see an occasional use for interactive debugging tools. One such tool is the kdb built-in kernel debugger, available as a nonofficial patch from oss.sgi.com. To use kdb, you must obtain the patch (be sure to get a version that matches your kernel version), apply it, and rebuild and reinstall the kernel. Note that, as of this writing, kdb works only on IA-32 (x86) systems (though a version for the IA-64 existed for a while in the mainline kernel source before being removed).

一旦运行启用了kdb 的内核,有多种方法可以进入调试器。按控制台上的 Pause(或 Break)键启动调试器。当发生内核 oops 或命中断点时,kdb也会启动。无论如何,您都会看到一条类似于以下内容的消息:

Once you are running a kdb-enabled kernel, there are a couple of ways to enter the debugger. Pressing the Pause (or Break) key on the console starts up the debugger. kdb also starts up when a kernel oops happens or when a breakpoint is hit. In any case, you see a message that looks something like this:

由于键盘输入而在处理器 0 上输入 kdb (0xc0347b80)
[0]kdb>
Entering kdb (0xc0347b80) on processor 0 due to Keyboard Entry
[0]kdb>

请注意,当kdb运行时,内核所做的几乎所有事情都会停止。在调用 kdb 的系统上不应运行其他任何东西;特别是,您不应该打开网络 - 当然,除非您正在调试网络驱动程序。如果您要使用kdb,通常最好以单用户模式启动系统。

Note that just about everything the kernel does stops when kdb is running. Nothing else should be running on a system where you invoke kdb; in particular, you should not have networking turned on—unless, of course, you are debugging a network driver. It is generally a good idea to boot the system in single-user mode if you will be using kdb.

作为示例,请考虑快速scull调试会话。假设驱动程序已经加载,我们可以告诉kdb在scull_read中设置断点,如下所示:

As an example, consider a quick scull debugging session. Assuming that the driver is already loaded, we can tell kdb to set a breakpoint in scull_read as follows:

[0]kdb>bp scull_read
指令(i) BP #0 位于 0xcd087c5dc (scull_read)
    启用全局调整1
[0]kdb>go
[0]kdb> bp scull_read
Instruction(i) BP #0 at 0xcd087c5dc (scull_read)
    is enabled globally adjust 1
[0]kdb> go

bp命令告诉kdb在内核下次进入scull_read时停止。然后您键入go以继续执行。将某些内容放入其中一个scull设备后,我们可以尝试通过在另一个终端的 shell 下运行cat来读取它 ,产生以下结果:

The bp command tells kdb to stop the next time the kernel enters scull_read. You then type go to continue execution. After putting something into one of the scull devices, we can attempt to read it by running cat under a shell on another terminal, yielding the following:

指令(i)断点#0位于0xd087c5dc(已调整)
0xd087c5dc scull_read:int3

由于以下原因,在处理器 0 上输入 kdb(当前 = 0xcf09f890,pid 1575)
断点@0xd087c5dc
[0]kdb>
Instruction(i) breakpoint #0 at 0xd087c5dc (adjusted)
0xd087c5dc scull_read:          int3

Entering kdb (current=0xcf09f890, pid 1575) on processor 0 due to
Breakpoint @ 0xd087c5dc
[0]kdb>

我们现在位于scull_read的开头。要了解我们是如何到达那里的,我们可以获得堆栈跟踪:

We are now positioned at the beginning of scull_read. To see how we got there, we can get a stack trace:

[0]kdb>bt
    ESP EIP 函数 (args)
0xcdbddf74 0xd087c5dc [scull]scull_read
0xcdbddf78 0xc0150718 vfs_read+0xb8
0xcdbddfa4 0xc01509c2 sys_read+0x42
0xcdbddfc4 0xc0103fcf 系统调用_调用+0x7
[0]kdb>
[0]kdb> bt
    ESP    EIP        Function (args)
0xcdbddf74 0xd087c5dc [scull]scull_read
0xcdbddf78 0xc0150718 vfs_read+0xb8
0xcdbddfa4 0xc01509c2 sys_read+0x42
0xcdbddfc4 0xc0103fcf syscall_call+0x7
[0]kdb>

kdb尝试打印出调用跟踪中每个函数的参数。然而,它会因为编译器使用的优化技巧而感到困惑。因此,它无法打印scull_read的参数。

kdb attempts to print out the arguments to every function in the call trace. It gets confused, however, by optimization tricks used by the compiler. Therefore, it fails to print the arguments to scull_read.

是时候看一些数据了。mds命令操作数据;scull_devices我们可以使用以下命令查询指针的值:

Time to look at some data. The mds command manipulates data; we can query the value of the scull_devices pointer with a command such as:

[0]kdb> mds scull_devices 1
0xd0880de8 cf36ac00 ....
[0]kdb> mds scull_devices 1
0xd0880de8 cf36ac00    ....

scull_devices这里我们要求从;的位置开始一个(4 字节)字的数据。答案告诉我们,我们的设备数组位于该地址0xd0880de8;第一个设备结构本身位于0xcf36ac00。要查看该设备结构,我们需要使用该地址:

Here we asked for one (4-byte) word of data starting at the location of scull_devices; the answer tells us that our device array is at the address 0xd0880de8; the first device structure itself is at 0xcf36ac00. To look at that device structure, we need to use that address:

[0]kdb>mds cf36ac00
0xcf36ac00 ce137dbc ....
0xcf36ac04 00000fa0 ....
0xcf36ac08 000003e8 ....
0xcf36ac0c 0000009b ....
0xcf36ac10 00000000 ....
0xcf36ac14 00000001 ....
0xcf36ac18 00000000 ....
0xcf36ac1c 00000001 ....
[0]kdb> mds cf36ac00
0xcf36ac00 ce137dbc ....
0xcf36ac04 00000fa0 ....
0xcf36ac08 000003e8 ....
0xcf36ac0c 0000009b ....
0xcf36ac10 00000000 ....
0xcf36ac14 00000001 ....
0xcf36ac18 00000000 ....
0xcf36ac1c 00000001 ....

这里的八行对应于结构的开始部分scull_dev。因此,我们看到第一个设备的内存分配在0xce137dbc,量程为 4000(十六进制 fa0),量程集大小为 1000(十六进制3e8),当前9b设备中存储了 155(十六进制)字节。

The eight lines here correspond to the beginning part of the scull_dev structure. Therefore, we see that the memory for the first device is allocated at 0xce137dbc, the quantum is 4000 (hex fa0), the quantum set size is 1000 (hex 3e8), and there are currently 155 (hex 9b) bytes stored in the device.

kdb也可以更改数据。假设我们想从设备中删除一些数据:

kdb can change data as well. Suppose we wanted to trim some of the data from the device:

[0]kdb>  mm cf26ac0c 0x50
0xcf26ac0c = 0x50
[0]kdb> mm cf26ac0c 0x50
0xcf26ac0c = 0x50

设备上的后续猫现在将返回比以前更少的数据。

A subsequent cat on the device will now return less data than before.

kdb具有许多其他功能,包括单步执行(通过指令,而不是 C 源代码行)、在数据访问上设置断点、反汇编代码、单步执行链表、访问寄存器数据等等。应用kdb补丁后,可以在内核源代码树的Documentation/kdb目录中找到完整的手册页集

kdb has a number of other capabilities, including single-stepping (by instructions, not lines of C source code), setting breakpoints on data access, disassembling code, stepping through linked lists, accessing register data, and more. After you have applied the kdb patch, a full set of manual pages can be found in the Documentation/kdb directory in your kernel source tree.

kgdb 补丁

The kgdb Patches

到目前为止,我们看到的两种交互式调试方法( 在/proc/kcorekdb上使用gdb)都达不到用户空间应用程序开发人员已经习惯的那种环境。如果有一个真正的内核调试器,支持更改变量、断点等功能,那不是很好吗?

The two interactive debugging approaches we have looked at so far (using gdb on /proc/kcore and kdb) both fall short of the sort of environment that user-space application developers have become used to. Wouldn't it be nice if there were a true debugger for the kernel that supported features like changing variables, breakpoints, etc.?

事实证明,这样的解决方案确实存在。截至撰写本文时,有两个单独的补丁正在流通,它们允许具有完整功能的gdb针对内核运行。令人困惑的是,这两个补丁都称为 kgdb。它们的工作原理是将运行测试内核的系统与运行调试器的系统分开;两者通常通过串行电缆连接。因此,开发人员可以在他或她的稳定桌面系统上运行gdb,同时在牺牲测试盒上运行的内核上进行操作。在这种模式下设置 gdb 一开始需要一些时间,但是当出现困难的错误时,这种投资可以很快得到回报。

As it turns out, such a solution does exist. There are, as of this writing, two separate patches in circulation that allow gdb, with full capabilities, to be run against the kernel. Confusingly, both of these patches are called kgdb. They work by separating the system running the test kernel from the system running the debugger; the two are typically connected via a serial cable. Therefore, the developer can run gdb on his or her stable desktop system, while operating on a kernel running on a sacrificial test box. Setting up gdb in this mode takes a little time at the outset, but that investment can pay off quickly when a difficult bug shows up.

这些补丁处于强烈的变化状态,甚至可能在某个时候合并,因此除了它们的位置和基本功能之外,我们避免过多谈论它们。我们鼓励有兴趣的读者去看看当前的事态。

These patches are in a strong state of flux, and may even be merged at some point, so we avoid saying much about them beyond where they are and their basic features. Interested readers are encouraged to look and see the current state of affairs.

第一个kgdb补丁目前在内核树中找到-mm——补丁进入 2.6 主线的暂存区域。此版本的补丁支持 x86、SuperH、ia64、x86_64、SPARC 和 32 位 PPC 架构。除了通常的串行端口操作模式外,该版本的kgdb还可以通过局域网进行通信。只需启用以太网模式并使用kgdboe参数集启动即可指示可以发出调试命令的 IP 地址。Documentation/i386/kgdb下的文档描述了如何进行设置。[ 4 ]

The first kgdb patch is currently found in the -mm kernel tree—the staging area for patches on their way into the 2.6 mainline. This version of the patch supports the x86, SuperH, ia64, x86_64, SPARC, and 32-bit PPC architectures. In addition to the usual mode of operation over a serial port, this version of kgdb can also communicate over a local-area network. It is simply a matter of enabling the Ethernet mode and booting with the kgdboe parameter set to indicate the IP address from which debugging commands can originate. The documentation under Documentation/i386/kgdb describes how to set things up.[4]

作为替代方案,您可以使用http://kgdb.sf.net/上找到的kgdb补丁。此版本的调试器不支持网络通信模式(尽管据说正在开发中),但它确实对使用可加载模块有一些内置支持。它支持 x86、x86_64、PowerPC 和 S/390 体系结构。

As an alternative, you can use the kgdb patch found on http://kgdb.sf.net/. This version of the debugger does not support the network communication mode (though that is said to be under development), but it does have some built-in support for working with loadable modules. It supports the x86, x86_64, PowerPC, and S/390 architectures.

用户模式 ​​Linux 端口

The User-Mode Linux Port

用户模式 ​​Linux (UML) 是一个有趣的概念。它被构造为 Linux 内核的一个单独端口,具有自己的arch/um子目录。然而,它并不是在新型硬件上运行。相反,它运行在 Linux 系统调用接口上实现的虚拟机上。因此,UML 允许 Linux 内核作为独立的用户模式进程在 Linux 系统上运行。

User-Mode Linux (UML) is an interesting concept. It is structured as a separate port of the Linux kernel with its own arch/um subdirectory. It does not run on a new type of hardware, however; instead, it runs on a virtual machine implemented on the Linux system call interface. Thus, UML allows the Linux kernel to run as a separate, user-mode process on a Linux system.

将内核副本作为用户模式进程运行会带来许多优点。由于它运行在受限的虚拟处理器上,因此有缺陷的内核不会损坏“真实”系统。可以在同一个盒子上轻松尝试不同的硬件和软件配置。而且,也许对于内核开发人员来说最重要的是,可以使用gdb或其他调试器轻松操作用户模式内核。毕竟,这只是另一个过程。UML 显然具有加速内核开发的潜力。

Having a copy of the kernel running as a user-mode process brings a number of advantages. Because it is running on a constrained, virtual processor, a buggy kernel cannot damage the "real" system. Different hardware and software configurations can be tried easily on the same box. And, perhaps most significantly for kernel developers, the user-mode kernel can be easily manipulated with gdb or another debugger. After all, it is just another process. UML clearly has the potential to accelerate kernel development.

然而,从驱动程序编写者的角度来看,UML 有一个很大的缺点:用户模式内核无法访问主机系统的硬件。因此,虽然 UML 对于调试本书中的大多数示例驱动程序很有用,但对于调试必须处理实际硬件的驱动程序还没有什么用处。

However, UML has a big shortcoming from the point of view of driver writers: the user-mode kernel has no access to the host system's hardware. Thus, while it can be useful for debugging most of the sample drivers in this book, UML is not yet useful for debugging drivers that have to deal with real hardware.

有关 UML 的更多信息,请参见http://user-mode-linux.sf.net/ 。

See http://user-mode-linux.sf.net/ for more information on UML.

Linux 跟踪工具包

The Linux Trace Toolkit

Linux Trace Toolkit (LTT) 是一个内核补丁和一组相关实用程序,允许跟踪内核中的事件。该跟踪包括计时信息,并且可以创建给定时间段内发生的情况的相当完整的图片。因此,它不仅可以用于调试,还可以用于跟踪性能问题。

The Linux Trace Toolkit (LTT) is a kernel patch and a set of related utilities that allow the tracing of events in the kernel. The trace includes timing information and can create a reasonably complete picture of what happened over a given period of time. Thus, it can be used not only for debugging but also for tracking down performance problems.

LTT 以及大量文档可以在 http://www.opersys.com/LTT找到

LTT, along with extensive documentation, can be found at http://www.opersys.com/LTT.

动态探头

Dynamic Probes

Dynamic Probes(或 DProbes)是 IBM 针对 IA-32 架构上的 Linux 发布的(在 GPL 下)调试工具。它允许在系统中的几乎任何位置(用户空间和内核空间)放置“探针”。探测器由一些代码(用专门的、面向堆栈的语言编写)组成,这些代码在控制到达给定点时执行。该代码可以将信息报告回用户空间、更改寄存器或执行许多其他操作。DProbes 的有用功能是,一旦将功能内置到内核中,探针就可以插入正在运行的系统中的任何位置,而无需构建内核或重新启动。DProbes 还可以与 LTT 配合使用,在任意位置插入新的跟踪事件。

Dynamic Probes (or DProbes) is a debugging tool released (under the GPL) by IBM for Linux on the IA-32 architecture. It allows the placement of a "probe" at almost any place in the system, in both user and kernel space. The probe consists of some code (written in a specialized, stack-oriented language) that is executed when control hits the given point. This code can report information back to user space, change registers, or do a number of other things. The useful feature of DProbes is that once the capability has been built into the kernel, probes can be inserted anywhere within a running system without kernel builds or reboots. DProbes can also work with the LTT to insert new tracing events at arbitrary locations.

DP探针 该工具可以从IBM的开源站点下载:http : //oss.software.ibm.com

The DProbes tool can be downloaded from IBM's open source site: http://oss.software.ibm.com.




[ 1 ]例如使用setlevel 8;setconsole 10设置终端 10 显示消息。

[1] For example, use setlevel 8; setconsole 10 to set up terminal 10 to display messages.

[ 2 ]连字符或减号是一个“神奇”标记,可防止 syslogd在每条新消息时将文件刷新到磁盘,记录在syslog.conf(5)中,这是一个值得一读的联机帮助页。

[2] The hyphen, or minus sign, is a "magic" marker to prevent syslogd from flushing the file to disk at every new message, documented in syslog.conf(5), a manpage worth reading.

[ 3 ]实际上,多处理器系统仍然在其他处理器上进行调度,如果启用了内核抢占,即使是单处理器机器也可能会重新调度。然而,对于最常见的情况(禁用抢占的单处理器),系统完全停止调度。

[3] Actually, multiprocessor systems still schedule on the other processors, and even a uniprocessor machine might reschedule if kernel preemption is enabled. For the most common case (uniprocessor with preemption disabled), however, the system stops scheduling altogether.

[ 4 ]然而,它确实忽略了指出您应该将网络适配器驱动程序内置到内核中,否则调试器在启动时无法找到它并会自行关闭。

[4] It does neglect to point out that you should have your network adapter driver built into the kernel, however, or the debugger fails to find it at boot time and will shut itself down.

第 5 章并发和竞争条件

Chapter 5. Concurrency and Race Conditions

到目前为止,我们很少关注并发问题,即当系统尝试同时做多件事时会发生什么。然而,并发管理是操作系统编程的核心问题之一。与并发相关的错误是一些最容易创建的错误,也是一些最难发现的错误。即使是专业的 Linux 内核程序员有时也会产生与并发相关的错误。

Thus far, we have paid little attention to the problem of concurrency—i.e., what happens when the system tries to do more than one thing at once. The management of concurrency is, however, one of the core problems in operating systems programming. Concurrency-related bugs are some of the easiest to create and some of the hardest to find. Even expert Linux kernel programmers end up creating concurrency-related bugs on occasion.

在早期的 Linux 内核中,并发源相对较少。内核不支持对称多处理(SMP)系统,并发执行的唯一原因是硬件中断服务。这种方法很简单,但在一个重视处理器数量越来越多的系统性能并要求系统快速响应事件的世界中,它不再适用。为了满足现代硬件和应用程序的需求,Linux 内核已经发展到可以同时处理更多事情的程度。这种演变带来了更高的性能和可扩展性。然而,它也使内核编程的任务变得非常复杂。

In early Linux kernels, there were relatively few sources of concurrency. Symmetric multiprocessing (SMP) systems were not supported by the kernel, and the only cause of concurrent execution was the servicing of hardware interrupts. That approach offers simplicity, but it no longer works in a world that prizes performance on systems with more and more processors, and that insists that the system respond to events quickly. In response to the demands of modern hardware and applications, the Linux kernel has evolved to a point where many more things are going on simultaneously. This evolution has resulted in far greater performance and scalability. It has also, however, significantly complicated the task of kernel programming. Device driver programmers must now factor concurrency into their designs from the beginning, and they must have a strong understanding of the facilities provided by the kernel for concurrency management.

本章的目的是开始建立这种理解的过程。为此,我们介绍了 第 3 章中立即应用于scull驱动程序的设施。这里介绍的其他设施尚未投入使用一段时间。但首先,我们来看看我们的系统可能会出现什么问题简单的 scull驱动程序以及如何避免这些潜在问题。

The purpose of this chapter is to begin the process of creating that understanding. To that end, we introduce facilities that are immediately applied to the scull driver from Chapter 3. Other facilities presented here are not put to use for some time yet. But first, we take a look at what could go wrong with our simple scull driver and how to avoid these potential problems.

划桨的陷阱

Pitfalls in scull

让我们快速看一下一个片段 scull内存管理代码。在写入逻辑的深处,scull必须决定它所需的内存是否已经分配。处理此任务的一段代码是:

Let us take a quick look at a fragment of the scull memory management code. Deep down inside the write logic, scull must decide whether the memory it requires has been allocated yet or not. One piece of the code that handles this task is:

    if (!dptr->data[s_pos]) {
        dptr->data[s_pos] = kmalloc(量子, GFP_KERNEL);
        if (!dptr->data[s_pos])
            转到出去;
    }
    if (!dptr->data[s_pos]) {
        dptr->data[s_pos] = kmalloc(quantum, GFP_KERNEL);
        if (!dptr->data[s_pos])
            goto out;
    }

假设有两个进程(我们称之为“A”和“B”)独立地尝试写入同一个scull设备中的相同偏移量。每个进程if同时到达上面片段第一行的测试。如果所讨论的指针是NULL,则每个进程将决定分配内存,并且每个进程都会将结果指针分配给dptr->data[s_pos]。由于两个进程都分配给同一位置,因此显然只有其中一个分配优先。

Suppose for a moment that two processes (we'll call them "A" and "B") are independently attempting to write to the same offset within the same scull device. Each process reaches the if test in the first line of the fragment above at the same time. If the pointer in question is NULL, each process will decide to allocate memory, and each will assign the resulting pointer to dptr->data[s_pos]. Since both processes are assigning to the same location, clearly only one of the assignments will prevail.

当然,接下来会发生的是,第二个完成任务的进程将“获胜”。如果进程A先分配,它的分配将被进程B覆盖。此时,scull将完全忘记A分配的内存;它只有一个指向B内存的指针。因此,A 分配的内存将被删除,并且永远不会返回给系统。

What will happen, of course, is that the process that completes the assignment second will "win." If process A assigns first, its assignment will be overwritten by process B. At that point, scull will forget entirely about the memory that A allocated; it only has a pointer to B's memory. The memory allocated by A, thus, will be dropped and never returned to the system.

这一系列事件演示了竞争条件 。竞争条件是对共享数据的不受控制的访问的结果。当发生错误的访问模式时,会出现意想不到的结果。对于这里讨论的竞争条件,结果是内存泄漏。这已经够糟糕了,但竞争条件通常会导致系统崩溃、数据损坏或安全问题。程序员可能会倾向于将竞争条件视为极低概率事件而忽略。但是,在计算世界中,每隔几秒钟就会发生百万分之一的事件,而且后果可能很严重。

This sequence of events is a demonstration of a race condition . Race conditions are a result of uncontrolled access to shared data. When the wrong access pattern happens, something unexpected results. For the race condition discussed here, the result is a memory leak. That is bad enough, but race conditions can often lead to system crashes, corrupted data, or security problems as well. Programmers can be tempted to disregard race conditions as extremely low probability events. But, in the computing world, one-in-a-million events can happen every few seconds, and the consequences can be grave.

我们很快就会消除scull中的竞争条件,但首先我们需要对并发有一个更全面的了解。

We will eliminate race conditions from scull shortly, but first we need to take a more general view of concurrency.

并发及其管理

Concurrency and Its Management

在现代 Linux 系统中, 并发源有很多,因此可能存在竞争条件。多个用户空间进程正在运行,它们可以以令人惊讶的方式组合访问您的代码。SMP 系统可以在不同的处理器上同时执行您的代码。内核代码是可抢占的;您的驱动程序代码可能随时失去处理器,并且替换它的进程也可能在您的驱动程序中运行。设备中断是异步事件,可能会导致代码并发执行。内核还提供了各种延迟代码执行的机制,例如工作队列、tasklet 和计时器,这些机制可以使您的代码以与当前进程正在执行的操作无关的方式随时运行。在现代的热插拔世界中,

In a modern Linux system, there are numerous sources of concurrency and, therefore, possible race conditions. Multiple user-space processes are running, and they can access your code in surprising combinations of ways. SMP systems can be executing your code simultaneously on different processors. Kernel code is preemptible; your driver's code can lose the processor at any time, and the process that replaces it could also be running in your driver. Device interrupts are asynchronous events that can cause concurrent execution of your code. The kernel also provides various mechanisms for delayed code execution, such as workqueues, tasklets, and timers, which can cause your code to run at any time in ways unrelated to what the current process is doing. In the modern, hot-pluggable world, your device could simply disappear while you are in the middle of working with it.

避免竞争条件可能是一项令人生畏的任务。在一个随时可能发生任何事情的世界中,驱动程序程序员如何避免造成绝对的混乱?事实证明,大多数竞争条件都可以通过一些想法、内核的并发控制原语以及一些基本原则的应用来避免。我们将首先从原则开始,然后详细讨论如何应用它们。

Avoidance of race conditions can be an intimidating task. In a world where anything can happen at any time, how does a driver programmer avoid the creation of absolute chaos? As it turns out, most race conditions can be avoided through some thought, the kernel's concurrency control primitives, and the application of a few basic principles. We'll start with the principles first, then get into the specifics of how to apply them.

竞争条件是由于共享资源访问而产生的。当两个执行线程[ 1 ]有理由使用相同的数据结构(或硬件资源)时,混合的可能性始终存在。因此,在设计驱动程序时要记住的第一条经验法则是尽可能避免共享资源。如果没有并发访问,就不可能存在竞争条件。所以精心编写的内核代码应该至少有分享。这个想法最明显的应用是避免使用全局变量。如果您将一项资源放在多个执行线程可以找到的位置,那么应该有充分的理由这样做。

Race conditions come about as a result of shared access to resources. When two threads of execution[1] have a reason to work with the same data structures (or hardware resources), the potential for mixups always exists. So the first rule of thumb to keep in mind as you design your driver is to avoid shared resources whenever possible. If there is no concurrent access, there can be no race conditions. So carefully-written kernel code should have a minimum of sharing. The most obvious application of this idea is to avoid the use of global variables. If you put a resource in a place where more than one thread of execution can find it, there should be a strong reason for doing so.

然而,事实是,这种共享常常是必需的。硬件资源本质上是共享的,而软件资源通常也必须可供多个线程使用。还要记住,全局变量远不是共享数据的唯一方法;任何时候你的代码将指针传递给内核的其他部分,它都可能创建一个新的共享情况。分享是生活的一个事实。

The fact of the matter is, however, that such sharing is often required. Hardware resources are, by their nature, shared, and software resources also must often be available to more than one thread. Bear in mind as well that global variables are far from the only way to share data; any time your code passes a pointer to some other part of the kernel, it is potentially creating a new sharing situation. Sharing is a fact of life.

这是资源共享的硬性规则:任何时候,硬件或软件资源在单个执行线程之外共享,并且存在一个线程可能遇到该资源的不一致视图的可能性,您必须显式管理对该资源的访问。在上面的scull例子中,进程 B 对情况的看法不一致;不知道进程 A 已经为(共享)设备分配了内存,它执行自己的分配并覆盖 A 的工作。在这种情况下,我们必须控制对scull数据结构的访问。我们需要进行安排,以便代码要么看到已分配的内存,要么知道没有或将要分配内存由其他人分配。通常的技术用于 访问管理称为锁定互斥——确保任何时候只有一个执行线程可以操作共享资源。本章余下的大部分内容将专门讨论锁定。

Here is the hard rule of resource sharing: any time that a hardware or software resource is shared beyond a single thread of execution, and the possibility exists that one thread could encounter an inconsistent view of that resource, you must explicitly manage access to that resource. In the scull example above, process B's view of the situation is inconsistent; unaware that process A has already allocated memory for the (shared) device, it performs its own allocation and overwrites A's work. In this case, we must control access to the scull data structure. We need to arrange things so that the code either sees memory that has been allocated or knows that no memory has been or will be allocated by anybody else. The usual technique for access management is called locking or mutual exclusion—making sure that only one thread of execution can manipulate a shared resource at any time. Much of the rest of this chapter will be devoted to locking.

然而,首先我们必须简要考虑另一条重要规则。当内核代码创建一个将与内核的任何其他部分共享的对象时,该对象必须继续存在(并正常运行),直到知道不存在对其的外部引用为止。当scull使其设备可用时,它必须准备好处理这些设备上的请求。并且scull必须继续能够处理其设备上的请求,直到它知道不存在对这些设备的引用(例如打开的用户空间文件)。两个要求 遵循这条规则:在内核处于可以正常运行的状态之前,任何对象都不能可供内核使用,并且必须跟踪对此类对象的引用。在大多数情况下,您会发现内核为您处理引用计数,但总有例外。

First, however, we must briefly consider one other important rule. When kernel code creates an object that will be shared with any other part of the kernel, that object must continue to exist (and function properly) until it is known that no outside references to it exist. The instant that scull makes its devices available, it must be prepared to handle requests on those devices. And scull must continue to be able to handle requests on its devices until it knows that no reference (such as open user-space files) to those devices exists. Two requirements come out of this rule: no object can be made available to the kernel until it is in a state where it can function properly, and references to such objects must be tracked. In most cases, you'll find that the kernel handles reference counting for you, but there are always exceptions.

遵循上述规则需要进行规划并仔细关注细节。并发访问您没有意识到是共享的资源很容易让人感到惊讶。然而,通过一些努力,大多数 比赛条件可以领先在他们咬你或你的用户之前就离开。

Following the above rules requires planning and careful attention to detail. It is easy to be surprised by concurrent access to resources you hadn't realized were shared. With some effort, however, most race conditions can be headed off before they bite you—or your users.

信号量和互斥体

Semaphores and Mutexes

那么让我们看看如何 向scull添加锁定。我们的目标是使scull数据结构 上的操作成为原子的,这意味着就其他执行线程而言,整个操作立即发生。对于我们的内存泄漏示例,我们需要确保如果一个线程发现必须分配特定的内存块,它有机会在任何其他线程进行该测试之前执行该分配。为此,我们必须设置临界区:在任何给定时间只能由一个线程执行的代码。

So let us look at how we can add locking to scull. Our goal is to make our operations on the scull data structure atomic, meaning that the entire operation happens at once as far as other threads of execution are concerned. For our memory leak example, we need to ensure that if one thread finds that a particular chunk of memory must be allocated, it has the opportunity to perform that allocation before any other thread can make that test. To this end, we must set up critical sections: code that can be executed by only one thread at any given time.

不是全部 关键部分是相同的,因此内核为不同的需求提供了不同的原语。在这种情况下,由于直接用户请求,对scull数据结构的每次访问都 发生在进程上下文中;中断处理程序或其他异步上下文不会进行任何访问。没有特定的延迟(响应时间)要求;应用程序员知道 I/O 请求通常不会立即得到满足。此外,scull在访问自己的数据结构时不持有任何其他关键系统资源。这一切意味着如果 双桨驱动程序在等待轮到访问数据结构时进入睡眠状态,没有人会介意。

Not all critical sections are the same, so the kernel provides different primitives for different needs. In this case, every access to the scull data structure happens in process context as a result of a direct user request; no accesses will be made from interrupt handlers or other asynchronous contexts. There are no particular latency (response time) requirements; application programmers understand that I/O requests are not usually satisfied immediately. Furthermore, the scull is not holding any other critical system resource while it is accessing its own data structures. What all this means is that if the scull driver goes to sleep while waiting for its turn to access the data structure, nobody is going to mind.

在这种情况下,“去睡觉”是一个定义明确的术语。当 Linux 进程达到无法进行任何进一步进程的程度时,它就会进入睡眠状态(或“阻塞”),将处理器让给其他人,直到将来某个时候它可以再次完成工作。进程在等待 I/O 完成时通常会休眠。随着深入内核,我们会遇到很多睡不着的情况。然而,scull 中的 write 方法不是其中一种情况。因此,我们可以使用锁定机制,该机制可能会导致进程在等待访问临界区时休眠。

"Go to sleep" is a well-defined term in this context. When a Linux process reaches a point where it cannot make any further processes, it goes to sleep (or "blocks"), yielding the processor to somebody else until some future time when it can get work done again. Processes often sleep when waiting for I/O to complete. As we get deeper into the kernel, we will encounter a number of situations where we cannot sleep. The write method in scull is not one of those situations, however. So we can use a locking mechanism that might cause the process to sleep while waiting for access to the critical section.

同样重要的是,我们将执行一个可以休眠的操作(使用 kmalloc进行内存分配),因此无论如何,睡觉都是可能的。如果我们的临界区要正常工作,我们必须使用一个在拥有锁的线程休眠时起作用的锁定原语。并非所有锁定机制都可以在可能睡眠的情况下使用(我们将在本章后面看到一些不适用的锁定机制)。然而,对于我们目前的需求来说,最适合的机制是 信号量

Just as importantly, we will be performing an operation (memory allocation with kmalloc) that could sleep—so sleeps are a possibility in any case. If our critical sections are to work properly, we must use a locking primitive that works when a thread that owns the lock sleeps. Not all locking mechanisms can be used where sleeping is a possibility (we'll see some that don't later in this chapter). For our present needs, however, the mechanism that fits best is a semaphore.

信号量是计算机科学中一个易于理解的概念。从本质上讲,信号量是单个整数值与一对通常称为PV 的函数。希望进入临界区的进程将调用相关信号量上的P ;如果信号量的值大于零,则该值减一并且该过程继续。相反,如果信号量的值是0(或更少),则进程必须等待,直到其他人释放信号量。 解锁信号量是通过调用V来完成的;该函数增加信号量的值,并在必要时唤醒正在等待的进程。

Semaphores are a well-understood concept in computer science. At its core, a semaphore is a single integer value combined with a pair of functions that are typically called P and V. A process wishing to enter a critical section will call P on the relevant semaphore; if the semaphore's value is greater than zero, that value is decremented by one and the process continues. If, instead, the semaphore's value is 0 (or less), the process must wait until somebody else releases the semaphore. Unlocking a semaphore is accomplished by calling V; this function increments the value of the semaphore and, if necessary, wakes up processes that are waiting.

当信号量用于互斥(防止多个进程同时在关键部分内运行)时,它们的值最初将设置为1。这样的信号量在任何给定时间只能由单个进程或线程持有。这种模式中使用的信号量有时称为互斥体,当然,它是“互斥”的缩写。Linux 内核中几乎所有信号量都用于互斥。

When semaphores are used for mutual exclusion—keeping multiple processes from running within a critical section simultaneously—their value will be initially set to 1. Such a semaphore can be held only by a single process or thread at any given time. A semaphore used in this mode is sometimes called a mutex, which is, of course, an abbreviation for "mutual exclusion." Almost all semaphores found in the Linux kernel are used for mutual exclusion.

Linux 信号量实现

The Linux Semaphore Implementation

Linux 内核 提供了符合上述语义的信号量实现,尽管术语略有不同。要使用信号量,内核代码必须包含<asm/semaphore.h>。相关类型是struct semaphore;实际的信号量可以通过几种方式声明和初始化。一种是直接创建信号量,然后使用sema_init设置它 :

The Linux kernel provides an implementation of semaphores that conforms to the above semantics, although the terminology is a little different. To use semaphores, kernel code must include <asm/semaphore.h>. The relevant type is struct semaphore; actual semaphores can be declared and initialized in a few ways. One is to create a semaphore directly, then set it up with sema_init:

void sema_init(struct semaphore *sem, int val);
void sema_init(struct semaphore *sem, int val);

其中val是分配给信号量的初始值。

where val is the initial value to assign to a semaphore.

然而,信号量通常用于互斥模式。为了使这种常见情况变得更容易一些,内核提供了一组辅助函数和宏。因此,可以使用以下之一来声明和初始化互斥锁:

Usually, however, semaphores are used in a mutex mode. To make this common case a little easier, the kernel has provided a set of helper functions and macros. Thus, a mutex can be declared and initialized with one of the following:

DECLARE_MUTEX(名称);
DECLARE_MUTEX_LOCKED(名称);
DECLARE_MUTEX(name);
DECLARE_MUTEX_LOCKED(name);

这里,结果是一个信号量变量(称为name),被初始化为1(with DECLARE_MUTEX) 0(与DECLARE_MUTEX_LOCKED)。在后一种情况下,互斥锁一开始处于锁定状态;在允许任何线程访问之前,必须显式解锁它。

Here, the result is a semaphore variable (called name) that is initialized to 1 (with DECLARE_MUTEX) or 0 (with DECLARE_MUTEX_LOCKED). In the latter case, the mutex starts out in a locked state; it will have to be explicitly unlocked before any thread will be allowed access.

如果互斥锁必须在运行时初始化(例如,如果是动态分配的),请使用以下方法之一:

If the mutex must be initialized at runtime (which is the case if it is allocated dynamically, for example), use one of the following:

void init_MUTEX(struct semaphore *sem);
void init_MUTEX_LOCKED(struct semaphore *sem);
void init_MUTEX(struct semaphore *sem);
void init_MUTEX_LOCKED(struct semaphore *sem);

在 Linux 世界中,P函数被称为 down——或者该名称的某种变体。这里,“down”指的是该函数递减信号量的值,并且可能在使调用者休眠一段时间以等待信号量变得可用之后,授予对受保护资源的访问权限。down共有三个版本 :

In the Linux world, the P function is called down—or some variation of that name. Here, "down" refers to the fact that the function decrements the value of the semaphore and, perhaps after putting the caller to sleep for a while to wait for the semaphore to become available, grants access to the protected resources. There are three versions of down:

void down(struct semaphore *sem);
int down_interruptible(struct semaphore *sem);
int down_trylock(结构信号量*sem);
void down(struct semaphore *sem);
int down_interruptible(struct semaphore *sem);
int down_trylock(struct semaphore *sem);

向下递减 信号量的值并根据需要等待。down_interruptible 的作用相同,但操作是可中断的。可中断版本几乎总是您想要的版本;它允许等待信号量的用户空间进程被用户中断。一般来说,您不想使用不可中断操作,除非确实没有其他选择。不可中断操作是创建不可终止进程(在 ps中看到的可怕的“D 状态” )并惹恼用户的好方法。使用 down_interruptible需要额外小心,但是,如果操作被中断,函数将返回非零值,并且调用者不会 持有信号量。正确使用 down_interruptible需要始终检查返回值并做出相应响应。

down decrements the value of the semaphore and waits as long as need be. down_interruptible does the same, but the operation is interruptible. The interruptible version is almost always the one you will want; it allows a user-space process that is waiting on a semaphore to be interrupted by the user. You do not, as a general rule, want to use noninterruptible operations unless there truly is no alternative. Non-interruptible operations are a good way to create unkillable processes (the dreaded "D state" seen in ps), and annoy your users. Using down_interruptible requires some extra care, however, if the operation is interrupted, the function returns a nonzero value, and the caller does not hold the semaphore. Proper use of down_interruptible requires always checking the return value and responding accordingly.

最终版本(down_trylock)永不休眠;如果信号量在调用时不可用,则down_trylock立即返回并返回非零值。

The final version (down_trylock) never sleeps; if the semaphore is not available at the time of the call, down_trylock returns immediately with a nonzero return value.

一旦线程成功调用了down的某个版本 ,就可以说它“持有”信号量(或者“取出”或“获取”信号量)。该线程现在有权访问受信号量保护的关键部分。当需要互斥的操作完成时,必须返回信号量。相当于V的 Linux 已上线

Once a thread has successfully called one of the versions of down, it is said to be "holding" the semaphore (or to have "taken out" or "acquired" the semaphore). That thread is now entitled to access the critical section protected by the semaphore. When the operations requiring mutual exclusion are complete, the semaphore must be returned. The Linux equivalent to V is up:

void up(struct semaphore *sem);
void up(struct semaphore *sem);

一旦被叫起来 调用者不再持有信号量。

Once up has been called, the caller no longer holds the semaphore.

正如您所期望的,任何取出信号量的线程都需要通过一次(且仅一次)调用up来释放它。错误路径通常需要特别小心;如果在持有信号量时遇到错误,则在将错误状态返回给调用者之前必须释放信号量。未能释放信号量是一个很容易犯的错误;结果(进程挂在看似不相关的地方)可能很难重现和追踪。

As you would expect, any thread that takes out a semaphore is required to release it with one (and only one) call to up. Special care is often required in error paths; if an error is encountered while a semaphore is held, that semaphore must be released before returning the error status to the caller. Failure to free a semaphore is an easy error to make; the result (processes hanging in seemingly unrelated places) can be hard to reproduce and track down.

在 scull 中使用信号量

Using Semaphores in scull

信号量机制赋予了scull 一种可用于在访问scull_dev数据结构时避免竞争条件的工具。但正确使用该工具取决于我们。正确使用锁定原语的关键是准确指定要保护哪些资源,并确保对这些资源的每次访问都使用正确的锁定。在我们的示例驱动程序中,所有感兴趣的内容都包含在结构中scull_dev,因此这是我们锁定机制的逻辑范围。

The semaphore mechanism gives scull a tool that can be used to avoid race conditions while accessing the scull_dev data structure. But it is up to us to use that tool correctly. The keys to proper use of locking primitives are to specify exactly which resources are to be protected and to make sure that every access to those resources uses the proper locking. In our example driver, everything of interest is contained within the scull_dev structure, so that is the logical scope for our locking regime.

让我们再看看这个结构:

Let's look again at that structure:

结构 scull_dev {
    结构 scull_qset *数据;/* 指向第一个量子集的指针 */
    整数量子;/* 当前量子大小 */
    int qset;/* 当前数组大小 */
    无符号长尺寸;/* 这里存储的数据量 */
    无符号整型访问密钥;/* 由 sculluid 和 scullpriv 使用 */
    结构信号量 sem;/* 互斥信号量 */
    结构体cdev cdev;/* Char设备结构*/
};
struct scull_dev {
    struct scull_qset *data;  /* Pointer to first quantum set */
    int quantum;              /* the current quantum size */
    int qset;                 /* the current array size */
    unsigned long size;       /* amount of data stored here */
    unsigned int access_key;  /* used by sculluid and scullpriv */
    struct semaphore sem;     /* mutual exclusion semaphore     */
    struct cdev cdev;     /* Char device structure      */
};

结构的底部有一个名为 的成员sem,这当然是我们的信号量。我们选择为每个虚拟scull设备使用单独的信号量。使用单个全局信号量同样是正确的。然而,各种scull设备不共享共同的资源,并且没有理由让一个进程在另一个进程正在使用不同的scull设备时等待。为每个设备使用单独的信号量允许不同设备上的操作并行进行,从而提高性能。

Toward the bottom of the structure is a member called sem which is, of course, our semaphore. We have chosen to use a separate semaphore for each virtual scull device. It would have been equally correct to use a single, global semaphore. The various scull devices share no resources in common, however, and there is no reason to make one process wait while another process is working with a different scull device. Using a separate semaphore for each device allows operations on different devices to proceed in parallel and, therefore, improves performance.

信号量在使用前必须初始化。scull在加载时在此循环中执行此初始化:

Semaphores must be initialized before use. scull performs this initialization at load time in this loop:

    for (i = 0; i < scull_nr_devs; i++) {
        scull_devices[i].quantum = scull_quantum;
        scull_devices[i].qset = scull_qset;
        init_MUTEX(&scull_devices[i].sem);
        scull_setup_cdev(&scull_devices[i], i);
    }
    for (i = 0; i < scull_nr_devs; i++) {
        scull_devices[i].quantum = scull_quantum;
        scull_devices[i].qset = scull_qset;
        init_MUTEX(&scull_devices[i].sem);
        scull_setup_cdev(&scull_devices[i], i);
    }

请注意,在scull设备 可供系统的其余部分使用之前,必须初始化信号量。因此,init_MUTEX在scull_setup_cdev之前调用 。以相反的顺序执行这些操作会产生竞争条件,在信号量准备好之前就可以对其进行访问。

Note that the semaphore must be initialized before the scull device is made available to the rest of the system. Therefore, init_MUTEX is called before scull_setup_cdev. Performing these operations in the opposite order would create a race condition where the semaphore could be accessed before it is ready.

接下来,我们必须检查代码并确保在不scull_dev持有信号量的情况下不会对数据结构进行访问。因此,例如scull_write以以下代码开头:

Next, we must go through the code and make sure that no accesses to the scull_dev data structure are made without holding the semaphore. Thus, for example, scull_write begins with this code:

    if (down_interruptible(&dev->sem))
        返回-ERESTARTSYS;
    if (down_interruptible(&dev->sem))
        return -ERESTARTSYS;

注意对down_interruptible返回值的检查;如果它返回非零,则操作被中断。在这种情况下通常要做的就是返回-ERESTARTSYS。看到此返回代码后,内核的较高层将从头开始重新启动调用或将错误返回给用户。如果返回-ERESTARTSYS,则必须首先撤消可能已进行的任何用户可见的更改,以便重试系统调用时发生正确的事情。如果您无法以这种方式撤消事情,-EINTR那么您应该返回。

Note the check on the return value of down_interruptible; if it returns nonzero, the operation was interrupted. The usual thing to do in this situation is to return -ERESTARTSYS. Upon seeing this return code, the higher layers of the kernel will either restart the call from the beginning or return the error to the user. If you return -ERESTARTSYS, you must first undo any user-visible changes that might have been made, so that the right thing happens when the system call is retried. If you cannot undo things in this manner, you should return -EINTR instead.

无论scull_write是否能够成功执行其他任务,它都必须释放信号量。如果一切顺利,执行将进入函数的最后几行:

scull_write must release the semaphore whether or not it was able to carry out its other tasks successfully. If all goes well, execution falls into the final few lines of the function:

出去:
  向上(&dev->sem);
  返回retval;
out:
  up(&dev->sem);
  return retval;

此代码释放信号量并返回所需的任何状态。scull_write中有几个地方可能会出错;其中包括内存分配失败或尝试从用户空间复制数据时出现错误。在这些情况下,代码会执行 a goto out,确保完成正确的清理。

This code frees the semaphore and returns whatever status is called for. There are several places in scull_write where things can go wrong; these include memory allocation failures or a fault while trying to copy data from user space. In those cases, the code performs a goto out, ensuring that the proper cleanup is done.

读取器/写入器信号量

Reader/Writer Semaphores

信号量执行交互 排除所有调用者,无论每个线程可能想要做什么。然而,许多任务分为两种不同类型的工作:仅需要读取受保护数据结构的任务和必须进行更改的任务。只要没有人试图进行任何更改,通常就可以允许多个并发读取器。这样做可以显着优化性能;只读任务可以并行完成其工作,而无需等待其他读取者退出临界区。

Semaphores perform mutual exclusion for all callers, regardless of what each thread may want to do. Many tasks break down into two distinct types of work, however: tasks that only need to read the protected data structures and those that must make changes. It is often possible to allow multiple concurrent readers, as long as nobody is trying to make any changes. Doing so can optimize performance significantly; read-only tasks can get their work done in parallel without having to wait for other readers to exit the critical section.

Linux 内核针对这种情况提供了一种特殊类型的信号量,称为 rwsem(或“读取器/写入器信号量”)。指某东西的用途 驱动程序中的 rwsem 相对较少,但它们偶尔有用。

The Linux kernel provides a special type of semaphore called a rwsem (or "reader/writer semaphore") for this situation. The use of rwsems in drivers is relatively rare, but they are occasionally useful.

使用 rwsems 的代码必须包含<linux/rwsem.h>。读取器/写入器信号量的相关数据类型是struct rw_semaphore;rwsem 必须在运行时显式初始化:

Code using rwsems must include <linux/rwsem.h>. The relevant data type for reader/writer semaphores is struct rw_semaphore; an rwsem must be explicitly initialized at runtime with:

void init_rwsem(struct rw_semaphore *sem);
void init_rwsem(struct rw_semaphore *sem);

新初始化的 rwsem 可用于接下来的下一个任务(读取器或写入器)。需要只读访问的代码的接口是:

A newly initialized rwsem is available for the next task (reader or writer) that comes along. The interface for code needing read-only access is:

void down_read(struct rw_semaphore *sem);
int down_read_trylock(struct rw_semaphore *sem);
void up_read(struct rw_semaphore *sem);
void down_read(struct rw_semaphore *sem);
int down_read_trylock(struct rw_semaphore *sem);
void up_read(struct rw_semaphore *sem);

对down_read的调用提供对受保护资源的只读访问,可能与其他读取器同时进行。请注意, down_read可能会使调用进程进入不间断睡眠状态。如果读访问不可用, down_read_trylock不会等待;如果授予访问权限,则返回非零值,0否则返回非零值。请注意, down_read_trylock的约定与大多数内核函数的约定不同,其中成功由返回值 表示0。氩气使用down_read获得的 wsem 最终必须使用up_read释放。

A call to down_read provides read-only access to the protected resources, possibly concurrently with other readers. Note that down_read may put the calling process into an uninterruptible sleep. down_read_trylock will not wait if read access is unavailable; it returns nonzero if access was granted, 0 otherwise. Note that the convention for down_read_trylock differs from that of most kernel functions, where success is indicated by a return value of 0. A rwsem obtained with down_read must eventually be freed with up_read.

作家的界面类似:

The interface for writers is similar:

void down_write(struct rw_semaphore *sem);
int down_write_trylock(struct rw_semaphore *sem);
void up_write(struct rw_semaphore *sem);
void downgrade_write(struct rw_semaphore *sem);
void down_write(struct rw_semaphore *sem);
int down_write_trylock(struct rw_semaphore *sem);
void up_write(struct rw_semaphore *sem);
void downgrade_write(struct rw_semaphore *sem);

down_write down_write_trylockup_write 的行为都与其对应的读取器相同,当然,除了它们提供写访问之外。如果您遇到需要写入器锁定以进行快速更改,然后进行较长时间的只读访问的情况,则可以在完成更改后使用 downgrade_write 允许其他读取器进入

down_write, down_write_trylock, and up_write all behave just like their reader counterparts, except, of course, that they provide write access. If you have a situation where a writer lock is needed for a quick change, followed by a longer period of read-only access, you can use downgrade_write to allow other readers in once you have finished making changes.

rwsem 允许一个写入者或无限数量的读取者持有信号量。作家优先;一旦作者试图进入临界区,在所有作者完成其工作之前,任何读者都将被禁止进入。如果有大量写入者争夺信号量,这种实现可能会导致读取器饥饿(读取器长时间被拒绝访问)。因此,当进行写访问时,最好使用 rwsem 很少需要,并且写入访问权限会保留很短的时间。

An rwsem allows either one writer or an unlimited number of readers to hold the semaphore. Writers get priority; as soon as a writer tries to enter the critical section, no readers will be allowed in until all writers have completed their work. This implementation can lead to reader starvation—where readers are denied access for a long time—if you have a large number of writers contending for the semaphore. For this reason, rwsems are best used when write access is required only rarely, and writer access is held for short periods of time.

竣工数量

Completions

一个常见的模式是 内核编程涉及在当前线程之外启动某些活动,然后等待该活动完成。此活动可以是创建新的内核线程或用户空间进程、对现有进程的请求或某种基于硬件的操作。在这种情况下,使用信号量进行同步可能很诱人这两个任务的代码如下:

A common pattern in kernel programming involves initiating some activity outside of the current thread, then waiting for that activity to complete. This activity can be the creation of a new kernel thread or user-space process, a request to an existing process, or some sort of hardware-based action. It such cases, it can be tempting to use a semaphore for synchronization of the two tasks, with code such as:

结构信号量 sem;

init_MUTEX_LOCKED(&sem);
启动外部任务(&sem);
向下(&sem);
struct semaphore sem;

init_MUTEX_LOCKED(&sem);
start_external_task(&sem);
down(&sem);

然后,外部任务可以up(&sem)在其工作完成后调用。

The external task can then call up(&sem) when its work is done.

事实证明,信号量并不是在这种情况下使用的最佳工具。在正常使用中,尝试锁定信号量的代码会发现该信号量几乎一直可用;如果信号量存在严重争用,性能就会受到影响,并且需要审查锁定方案。因此,信号量已针对“可用”情况进行了大量优化。然而,当用于以上面所示的方式传达任务完成时,向下调用的线程几乎总是需要等待;性能将受到相应影响。如果信号量被声明为自动变量,则当以这种方式使用信号量时,它们也可能会受到(困难的)竞争条件的影响。在某些情况下,信号量可能会在进程调用之前消失已经完成了。

As is turns out, semaphores are not the best tool to use in this situation. In normal use, code attempting to lock a semaphore finds that semaphore available almost all the time; if there is significant contention for the semaphore, performance suffers and the locking scheme needs to be reviewed. So semaphores have been heavily optimized for the "available" case. When used to communicate task completion in the way shown above, however, the thread calling down will almost always have to wait; performance will suffer accordingly. Semaphores can also be subject to a (difficult) race condition when used in this way if they are declared as automatic variables. In some cases, the semaphore could vanish before the process calling up is finished with it.

这些问题激发了在 2.4.7 内核中添加“完成”接口的灵感。完成是一种只执行一项任务的轻量级机制:允许一个线程告诉另一个线程任务已完成。要使用补全,您的代码必须包含<linux/completion.h>。可以使用以下命令创建完成:

These concerns inspired the addition of the "completion" interface in the 2.4.7 kernel. Completions are a lightweight mechanism with one task: allowing one thread to tell another that the job is done. To use completions, your code must include <linux/completion.h>. A completion can be created with:

DECLARE_COMPLETION(my_completion);
DECLARE_COMPLETION(my_completion);

或者,如果必须创建完成并动态初始化:

Or, if the completion must be created and initialized dynamically:

结构完成my_completion;
/* ... */
init_completion(&my_completion);
struct completion my_completion;
/* ... */
init_completion(&my_completion);

等待完成很简单,调用:

Waiting for the completion is a simple matter of calling:

void wait_for_completion(结构完成 *c);
void wait_for_completion(struct completion *c);

请注意,此函数执行不间断的等待。如果您的代码调用 wait_for_completion并且没有人完成该任务,则结果将是一个无法终止的进程。[ 2 ]

Note that this function performs an uninterruptible wait. If your code calls wait_for_completion and nobody ever completes the task, the result will be an unkillable process.[2]

另一方面,实际完成事件可以通过调用以下方法之一来发出信号:

On the other side, the actual completion event may be signalled by calling one of the following:

无效完成(结构完成* c);
voidcomplete_all(结构完成*c);
void complete(struct completion *c);
void complete_all(struct completion *c);

如果多个线程正在等待同一完成事件,则这两个函数的行为会有所不同。complete仅唤醒一个等待线程,而complete_all允许所有线程继续进行。在大多数情况下,只有一名服务员,这两个函数将产生相同的结果。

The two functions behave differently if more than one thread is waiting for the same completion event. complete wakes up only one of the waiting threads while complete_all allows all of them to proceed. In most cases, there is only one waiter, and the two functions will produce an identical result.

完成通常是一次性设备;使用一次后就被丢弃。然而,如果采取适当的措施,可以重用完成结构。如果 不使用complete_all,只要对发出信号的事件没有歧义,就可以重用完成结构而不会出现任何问题。但是,如果您使用complete_all,则必须在重新使用完成结构之前重新初始化它。这宏:

A completion is normally a one-shot device; it is used once then discarded. It is possible, however, to reuse completion structures if proper care is taken. If complete_all is not used, a completion structure can be reused without any problems as long as there is no ambiguity about what event is being signalled. If you use complete_all, however, you must reinitialize the completion structure before reusing it. The macro:

INIT_COMPLETION(结构完成c);
INIT_COMPLETION(struct completion c);

可用于快速执行此重新初始化。

can be used to quickly perform this reinitialization.

作为如何使用补全的示例,请考虑完整的 模块,包含在示例源中。该模块定义了一个具有简单语义的设备:任何尝试从该设备读取的进程都将等待(使用 wait_for_completion),直到其他进程写入该设备。实现此行为的代码是:

As an example of how completions may be used, consider the complete module, which is included in the example source. This module defines a device with simple semantics: any process trying to read from the device will wait (using wait_for_completion) until some other process writes to the device. The code which implements this behavior is:

DECLARE_COMPLETION(comp);

ssize_t Complete_read (struct file *filp, char _ _user *buf, size_t count, loff_t *pos)
{
    printk(KERN_DEBUG "进程 %i (%s) 即将进入睡眠状态\n",
            当前->pid,当前->comm);
    wait_for_completion(&comp);
    printk(KERN_DEBUG "唤醒 %i (%s)\n", current->pid, current->comm);
    返回0;/* EOF */
}

ssize_t Complete_write (struct file *filp, const char _ _user *buf, size_t count,
        loff_t *位置)
{
    printk(KERN_DEBUG "进程 %i (%s) 唤醒读者...\n",
            当前->pid,当前->comm);
    完成(&comp);
    返回计数;/* 成功,避免重试 */
}
DECLARE_COMPLETION(comp);

ssize_t complete_read (struct file *filp, char _ _user *buf, size_t count, loff_t *pos)
{
    printk(KERN_DEBUG "process %i (%s) going to sleep\n",
            current->pid, current->comm);
    wait_for_completion(&comp);
    printk(KERN_DEBUG "awoken %i (%s)\n", current->pid, current->comm);
    return 0; /* EOF */
}

ssize_t complete_write (struct file *filp, const char _ _user *buf, size_t count,
        loff_t *pos)
{
    printk(KERN_DEBUG "process %i (%s) awakening the readers...\n",
            current->pid, current->comm);
    complete(&comp);
    return count; /* succeed, to avoid retrial */
}

多个进程可以同时从此设备“读取”。对设备的每次写入都会导致一次读操作完成,但无法知道会是哪一个。

It is possible to have multiple processes "reading" from this device at the same time. Each write to the device will cause exactly one read operation to complete, but there is no way to know which one it will be.

完成机制的典型用途是在模块退出时终止内核线程。在典型情况下,一些驱动程序内部工作是由循环中的内核线程执行的while (1)。当模块准备好被清理时,退出函数告诉线程退出,然后等待完成。为了这个目的,内核包含了一个特定的 线程使用的函数:

A typical use of the completion mechanism is with kernel thread termination at module exit time. In the prototypical case, some of the driver internal workings is performed by a kernel thread in a while (1) loop. When the module is ready to be cleaned up, the exit function tells the thread to exit and then waits for completion. To this aim, the kernel includes a specific function to be used by the thread:

void Complete_and_exit(结构完成 *c, long retval);
void complete_and_exit(struct completion *c, long retval);

自旋锁

Spinlocks

信号量是一种有用的互斥工具,但它们并不是内核提供的唯一此类工具。相反,大多数锁定是通过称为自旋锁的机制来实现的。与信号量不同,自旋锁可以用在无法休眠的代码中,例如中断处理程序。如果使用得当,自旋锁通常比信号量提供更高的性能。然而,它们确实对其使用带来了一系列不同的限制。

Semaphores are a useful tool for mutual exclusion, but they are not the only such tool provided by the kernel. Instead, most locking is implemented with a mechanism called a spinlock. Unlike semaphores, spinlocks may be used in code that cannot sleep, such as interrupt handlers. When properly used, spinlocks offer higher performance than semaphores in general. They do, however, bring a different set of constraints on their use.

自旋锁的概念很简单。自旋锁是一种互斥设备,只能有两个值:“锁定”和“解锁”。它通常被实现为整数值中的单个位。希望取出特定锁的代码会测试相关位。如果锁可用,则设置“锁定”位,并且代码继续进入临界区。相反,如果锁已被其他人占用,则代码将进入紧密循环,反复检查锁,直到锁可用为止。该循环是自旋锁的“自旋”部分。

Spinlocks are simple in concept. A spinlock is a mutual exclusion device that can have only two values: "locked" and "unlocked." It is usually implemented as a single bit in an integer value. Code wishing to take out a particular lock tests the relevant bit. If the lock is available, the "locked" bit is set and the code continues into the critical section. If, instead, the lock has been taken by somebody else, the code goes into a tight loop where it repeatedly checks the lock until it becomes available. This loop is the "spin" part of a spinlock.

当然,自旋锁的实际实现比上面的描述要复杂一些。“测试和设置”操作必须以原子方式完成,以便只有一个线程可以获得锁,即使在任何给定时间有多个线程正在旋转。还必须注意避免超线程上的死锁 处理器——实现共享单个处理器核心和高速缓存的多个虚拟 CPU 的芯片。因此,对于 Linux 支持的每种架构,实际的自旋锁实现都是不同的。核心概念在所有系统上都是相同的,但是,当存在自旋锁争用时,正在等待的处理器会执行紧密循环并且不会完成任何有用的工作。

Of course, the real implementation of a spinlock is a bit more complex than the description above. The "test and set" operation must be done in an atomic manner so that only one thread can obtain the lock, even if several are spinning at any given time. Care must also be taken to avoid deadlocks on hyperthreaded processors—chips that implement multiple, virtual CPUs sharing a single processor core and cache. So the actual spinlock implementation is different for every architecture that Linux supports. The core concept is the same on all systems, however, when there is contention for a spinlock, the processors that are waiting execute a tight loop and accomplish no useful work.

自旋锁本质上是用于多处理器系统,尽管就并发性而言,运行抢占式内核的单处理器工作站的行为类似于 SMP。如果一个非抢占式单处理器系统曾经在锁上发生自旋,那么它将永远自旋;没有其他线程能够获得 CPU 来释放锁。因此,未启用抢占的单处理器系统上的自旋锁操作被优化为不执行任何操作,但更改 IRQ 屏蔽状态的操作除外。由于抢占,即使您从不希望代码在 SMP 系统上运行,您仍然需要实现适当的锁定。

Spinlocks are, by their nature, intended for use on multiprocessor systems, although a uniprocessor workstation running a preemptive kernel behaves like SMP, as far as concurrency is concerned. If a nonpreemptive uniprocessor system ever went into a spin on a lock, it would spin forever; no other thread would ever be able to obtain the CPU to release the lock. For this reason, spinlock operations on uniprocessor systems without preemption enabled are optimized to do nothing, with the exception of the ones that change the IRQ masking status. Because of preemption, even if you never expect your code to run on an SMP system, you still need to implement proper locking.

Spinlock API 简介

Introduction to the Spinlock API

所需的包含文件 自旋锁原语是<linux/spinlock.h>。实际的锁具有类型spinlock_t。与任何其他数据结构一样,自旋锁必须初始化。此初始化可以在编译时完成,如下所示:

The required include file for the spinlock primitives is <linux/spinlock.h>. An actual lock has the type spinlock_t. Like any other data structure, a spinlock must be initialized. This initialization may be done at compile time as follows:

spinlock_t my_lock = SPIN_LOCK_UNLOCKED;
spinlock_t my_lock = SPIN_LOCK_UNLOCKED;

或者在运行时使用:

or at runtime with:

无效 spin_lock_init(spinlock_t *lock);
void spin_lock_init(spinlock_t *lock);

在进入关键部分之前,您的代码必须通过以下方式获得必要的锁:

Before entering a critical section, your code must obtain the requisite lock with:

无效 spin_lock(spinlock_t *lock);
void spin_lock(spinlock_t *lock);

请注意,所有自旋锁等待本质上都是不可中断的。一旦你调用 spin_lock,你就会旋转直到锁变得可用。

Note that all spinlock waits are, by their nature, uninterruptible. Once you call spin_lock, you will spin until the lock becomes available.

要释放您已获得的锁,请将其传递给:

To release a lock that you have obtained, pass it to:

无效 spin_unlock(spinlock_t *lock);
void spin_unlock(spinlock_t *lock);

还有许多其他自旋锁函数,我们很快就会看到它们。但它们都没有脱离上面列出的函数所展示的核心思想。除了锁定和释放锁之外,人们对锁几乎无能为力。但是,有一些关于如何使用自旋锁的规则。在进入完整的自旋锁界面之前,我们将花一些时间来了解一下这些内容。

There are many other spinlock functions, and we will look at them all shortly. But none of them depart from the core idea shown by the functions listed above. There is very little that one can do with a lock, other than lock and release it. However, there are a few rules about how you must work with spinlocks. We will take a moment to look at those before getting into the full spinlock interface.

自旋锁和原子上下文

Spinlocks and Atomic Context

想象一下你的驱动程序获取一个自旋锁并在其关键部分中处理其业务。在中间的某个地方,您的驱动程序失去了处理器。也许它调用了一个使进程进入睡眠状态的函数(例如copy_from_user )。或者,也许内核抢占启动,并且更高优先级的进程将您的代码推到一边。您的代码现在持有一个锁,在可预见的将来任何时候都不会释放。如果其他线程尝试获取相同的锁,在最好的情况下,它将等待(在处理器中旋转)很长时间。在最坏的情况下,系统可能完全陷入僵局。

Imagine for a moment that your driver acquires a spinlock and goes about its business within its critical section. Somewhere in the middle, your driver loses the processor. Perhaps it has called a function (copy_from_user, say) that puts the process to sleep. Or, perhaps, kernel preemption kicks in, and a higher-priority process pushes your code aside. Your code is now holding a lock that it will not release any time in the foreseeable future. If some other thread tries to obtain the same lock, it will, in the best case, wait (spinning in the processor) for a very long time. In the worst case, the system could deadlock entirely.

大多数读者都会同意最好避免这种情况。因此,适用于自旋锁的核心规则是任何代码在持有自旋锁时都必须是原子的。它无法入睡;事实上,除了服务中断(有时甚至不这样做)之外,它不能以任何原因放弃处理器。

Most readers would agree that this scenario is best avoided. Therefore, the core rule that applies to spinlocks is that any code must, while holding a spinlock, be atomic. It cannot sleep; in fact, it cannot relinquish the processor for any reason except to service interrupts (and sometimes not even then).

内核抢占情况由自旋锁代码本身处理。只要内核代码持有自旋锁,相关处理器上的抢占就会被禁用。即使是单处理器系统也必须以这种方式禁用抢占以避免竞争条件。这就是为什么即使您从未期望代码在多处理器计算机上运行,​​也需要适当的锁定。

The kernel preemption case is handled by the spinlock code itself. Any time kernel code holds a spinlock, preemption is disabled on the relevant processor. Even uniprocessor systems must disable preemption in this way to avoid race conditions. That is why proper locking is required even if you never expect your code to run on a multiprocessor machine.

避免睡觉时持有锁可能会更加困难;许多内核函数可以休眠,并且这种行为并不总是有很好的记录。将数据复制到用户空间或从用户空间复制数据是一个明显的例子:在复制继续之前,可能需要从磁盘交换所需的用户空间页面,并且该操作显然需要睡眠。几乎任何必须分配内存的操作都可以休眠;kmalloc可以决定放弃处理器,并等待更多内存可用,除非明确告知不要这样做。睡眠可能发生在令人惊讶的地方;编写将在自旋锁下执行的代码需要注意您调用的每个函数。

Avoiding sleep while holding a lock can be more difficult; many kernel functions can sleep, and this behavior is not always well documented. Copying data to or from user space is an obvious example: the required user-space page may need to be swapped in from the disk before the copy can proceed, and that operation clearly requires a sleep. Just about any operation that must allocate memory can sleep; kmalloc can decide to give up the processor, and wait for more memory to become available unless it is explicitly told not to. Sleeps can happen in surprising places; writing code that will execute under a spinlock requires paying attention to every function that you call.

这是另一种情况:您的驱动程序正在执行,并且刚刚取出了控制对其设备的访问的锁。保持锁定时,设备会发出中断,这会导致中断处理程序运行。中断处理程序在访问设备之前也必须获得锁。在中断处理程序中取出自旋锁是合法的事情;这是自旋锁操作不休眠的原因之一。但是,如果中断例程与最初取出锁的代码在同一处理器中执行,会发生什么情况?当中断处理程序正在旋转时,非中断代码将无法运行来释放锁。该处理器将永远旋转。

Here's another scenario: your driver is executing and has just taken out a lock that controls access to its device. While the lock is held, the device issues an interrupt, which causes your interrupt handler to run. The interrupt handler, before accessing the device, must also obtain the lock. Taking out a spinlock in an interrupt handler is a legitimate thing to do; that is one of the reasons that spinlock operations do not sleep. But what happens if the interrupt routine executes in the same processor as the code that took out the lock originally? While the interrupt handler is spinning, the noninterrupt code will not be able to run to release the lock. That processor will spin forever.

避免此陷阱需要在保持自旋锁时禁用中断(仅在本地 CPU 上)。自旋锁函数的一些变体可以为您禁用中断(我们将在下一节中看到它们)。然而,对中断的完整讨论必须等到第10章

Avoiding this trap requires disabling interrupts (on the local CPU only) while the spinlock is held. There are variants of the spinlock functions that will disable interrupts for you (we'll see them in the next section). However, a complete discussion of interrupts must wait until Chapter 10.

自旋锁使用的最后一个重要规则是自旋锁必须始终保持尽可能短的时间。持有锁的时间越长,另一个处理器可能必须旋转等待您释放它的时间就越长,并且它必须旋转的机会就越大。较长的锁定保持时间也会使当前处理器无法调度,这意味着更高优先级的进程(实际上应该能够获得 CPU)可能必须等待。在 2.5 开发系列中,内核开发人员投入了大量精力来减少内核延迟(进程可能需要等待调度的时间)。编写得不好的驱动程序可能会因为持有锁太久而毁掉所有进展。为了避免产生此类问题,

The last important rule for spinlock usage is that spinlocks must always be held for the minimum time possible. The longer you hold a lock, the longer another processor may have to spin waiting for you to release it, and the chance of it having to spin at all is greater. Long lock hold times also keep the current processor from scheduling, meaning that a higher priority process—which really should be able to get the CPU—may have to wait. The kernel developers put a great deal of effort into reducing kernel latency (the time a process may have to wait to be scheduled) in the 2.5 development series. A poorly written driver can wipe out all that progress just by holding a lock for too long. To avoid creating this sort of problem, make a point of keeping your lock-hold times short.

自旋锁函数

The Spinlock Functions

我们已经看到了两个操作自旋锁的函数 spin_lockspin_unlock。然而,还有其他几个具有相似名称和用途的函数。现在我们将展示全套。这个讨论将带我们进入一些我们还无法在几章中正确讨论的领域;要完整理解自旋锁 API,需要了解中断处理和相关概念。

We have already seen two functions, spin_lock and spin_unlock, that manipulate spinlocks. There are several other functions, however, with similar names and purposes. We will now present the full set. This discussion will take us into ground we will not be able to cover properly for a few chapters yet; a complete understanding of the spinlock API requires an understanding of interrupt handling and related concepts.

实际上有四个函数可以锁定自旋锁:

There are actually four functions that can lock a spinlock:

无效 spin_lock(spinlock_t *lock);
void spin_lock_irqsave(spinlock_t *lock, unsigned long flags);
无效 spin_lock_irq(spinlock_t *lock);
无效 spin_lock_bh(spinlock_t *lock)
void spin_lock(spinlock_t *lock);
void spin_lock_irqsave(spinlock_t *lock, unsigned long flags);
void spin_lock_irq(spinlock_t *lock);
void spin_lock_bh(spinlock_t *lock)

我们已经了解了spin_lock 的工作原理。 spin_lock_irqsave在获取自旋锁之前禁用中断(仅在本地处理器上);先前的中断状态存储在 中flags。如果您绝对确定没有其他任何东西可能已经在您的处理器上禁用了中断(或者,换句话说,您确定在释放自旋锁时应该启用中断),则可以使用 spin_lock_irq 代替,而 不必跟踪旗帜。最后,spin_lock_bh在获取锁定之前禁用软件中断,但启用硬件中断。

We have already seen how spin_lock works. spin_lock_irqsave disables interrupts (on the local processor only) before taking the spinlock; the previous interrupt state is stored in flags. If you are absolutely sure nothing else might have already disabled interrupts on your processor (or, in other words, you are sure that you should enable interrupts when you release your spinlock), you can use spin_lock_irq instead and not have to keep track of the flags. Finally, spin_lock_bh disables software interrupts before taking the lock, but leaves hardware interrupts enabled.

如果您有一个可以由在(硬件或软件)中断上下文中运行的代码获取的自旋锁,则必须使用一种禁用中断的spin_lock形式。否则迟早会导致系统陷入僵局。如果您不在硬件中断处理程序中访问锁,而是通过软件中断进行访问(例如,在小线程中运行的代码中,第 7 章中介绍的主题),则可以使用 spin_lock_bh 来安全地避免死锁, 同时仍然允许服务硬件中断。

If you have a spinlock that can be taken by code that runs in (hardware or software) interrupt context, you must use one of the forms of spin_lock that disables interrupts. Doing otherwise can deadlock the system, sooner or later. If you do not access your lock in a hardware interrupt handler, but you do via software interrupts (in code that runs out of a tasklet, for example, a topic covered in Chapter 7), you can use spin_lock_bh to safely avoid deadlocks while still allowing hardware interrupts to be serviced.

还有四种方法可以 释放自旋锁;您使用的函数必须与您用于获取锁定的函数相对应:

There are also four ways to release a spinlock; the one you use must correspond to the function you used to take the lock:

无效 spin_unlock(spinlock_t *lock);
void spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags);
无效 spin_unlock_irq(spinlock_t *lock);
无效 spin_unlock_bh(spinlock_t *lock);
void spin_unlock(spinlock_t *lock);
void spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags);
void spin_unlock_irq(spinlock_t *lock);
void spin_unlock_bh(spinlock_t *lock);

每个spin_unlock变体都会撤消相应spin_lock函数执行的工作。flags传递给spin_unlock_irqrestore 的参数必须与传递给spin_lock_irqsave 的变量相同。您还必须 在同一函数中调用spin_lock_irqsavespin_unlock_irqrestore ;否则,您的代码可能会在某些架构上崩溃。

Each spin_unlock variant undoes the work performed by the corresponding spin_lock function. The flags argument passed to spin_unlock_irqrestore must be the same variable passed to spin_lock_irqsave. You must also call spin_lock_irqsave and spin_unlock_irqrestore in the same function; otherwise, your code may break on some architectures.

还有一组非阻塞自旋锁 操作:

There is also a set of nonblocking spinlock operations:

int spin_trylock(spinlock_t *lock);
int spin_trylock_bh(spinlock_t *lock);
int spin_trylock(spinlock_t *lock);
int spin_trylock_bh(spinlock_t *lock);

这些函数在成功时返回非零(获得锁),0否则返回。没有禁用中断的“尝试”版本。

These functions return nonzero on success (the lock was obtained), 0 otherwise. There is no "try" version that disables interrupts.

读取器/写入器自旋锁

Reader/Writer Spinlocks

内核提供了一个 自旋锁的读取器/写入器形式,直接类似于我们在本章前面看到的读取器/写入器信号量。这些锁允许任意数量的读取者同时进入临界区,但写入者必须具有独占访问权限。读/写锁的类型为rwlock_t,在<linux/spinlock.h>中定义。它们可以通过两种方式声明和初始化:

The kernel provides a reader/writer form of spinlocks that is directly analogous to the reader/writer semaphores we saw earlier in this chapter. These locks allow any number of readers into a critical section simultaneously, but writers must have exclusive access. Reader/writer locks have a type of rwlock_t, defined in <linux/spinlock.h>. They can be declared and initialized in two ways:

rwlock_t my_rwlock = RW_LOCK_UNLOCKED; /* 静态方式 */

rwlock_t my_rwlock;
rwlock_init(&my_rwlock); /* 动态方式 */
rwlock_t my_rwlock = RW_LOCK_UNLOCKED; /* Static way */

rwlock_t my_rwlock;
rwlock_init(&my_rwlock);  /* Dynamic way */

现在可用的函数列表看起来应该相当熟悉。对于读者来说,可以使用以下功能:

The list of functions available should look reasonably familiar by now. For readers, the following functions are available:

无效read_lock(rwlock_t *lock);
void read_lock_irqsave(rwlock_t *lock, unsigned long flags);
无效 read_lock_irq(rwlock_t *lock);
无效 read_lock_bh(rwlock_t *lock);

无效 read_unlock(rwlock_t *lock);
void read_unlock_irqrestore(rwlock_t *lock, unsigned long flags);
无效 read_unlock_irq(rwlock_t *lock);
无效 read_unlock_bh(rwlock_t *lock);
void read_lock(rwlock_t *lock);
void read_lock_irqsave(rwlock_t *lock, unsigned long flags);
void read_lock_irq(rwlock_t *lock);
void read_lock_bh(rwlock_t *lock);

void read_unlock(rwlock_t *lock);
void read_unlock_irqrestore(rwlock_t *lock, unsigned long flags);
void read_unlock_irq(rwlock_t *lock);
void read_unlock_bh(rwlock_t *lock);

有趣的是,没有read_trylock

Interestingly, there is no read_trylock.

写访问的函数类似:

The functions for write access are similar:

无效 write_lock(rwlock_t *lock);
void write_lock_irqsave(rwlock_t *lock, unsigned long flags);
无效 write_lock_irq(rwlock_t *lock);
无效 write_lock_bh(rwlock_t *lock);
int write_trylock(rwlock_t *lock);

无效 write_unlock(rwlock_t *lock);
void write_unlock_irqrestore(rwlock_t *lock, unsigned long flags);
无效 write_unlock_irq(rwlock_t *lock);
无效 write_unlock_bh(rwlock_t *lock);
void write_lock(rwlock_t *lock);
void write_lock_irqsave(rwlock_t *lock, unsigned long flags);
void write_lock_irq(rwlock_t *lock);
void write_lock_bh(rwlock_t *lock);
int write_trylock(rwlock_t *lock);

void write_unlock(rwlock_t *lock);
void write_unlock_irqrestore(rwlock_t *lock, unsigned long flags);
void write_unlock_irq(rwlock_t *lock);
void write_unlock_bh(rwlock_t *lock);

读取器/写入器锁可能会导致读取器挨饿就像 rwsem 一样。这种行为很少会成为问题;然而,如果有足够多的锁争用导致饥饿,那么性能无论如何都会很差。

Reader/writer locks can starve readers just as rwsems can. This behavior is rarely a problem; however, if there is enough lock contention to bring about starvation, performance is poor anyway.

锁定陷阱

Locking Traps

多年的 锁的经验(早于 Linux 的经验)表明,正确的锁定非常困难。管理并发本质上是一项棘手的任务,并且有很多方式会犯错误。在本节中,我们将快速浏览一下可能出错的地方。

Many years of experience with locks—experience that predates Linux—have shown that locking can be very hard to get right. Managing concurrency is an inherently tricky undertaking, and there are many ways of making mistakes. In this section, we take a quick look at things that can go wrong.

模糊的规则

Ambiguous Rules

正如上面已经说过的,适当的锁定方案需要明确、明确的规则。当您创建可以并发访问的资源时,您应该定义哪个锁将控制该访问。锁定确实应该在一开始就安排好;之后进行改造可能会很困难。一开始花费的时间通常会在调试时得到丰厚的回报。

As has already been said above, a proper locking scheme requires clear and explicit rules. When you create a resource that can be accessed concurrently, you should define which lock will control that access. Locking should really be laid out at the beginning; it can be a hard thing to retrofit in afterward. Time taken at the outset usually is paid back generously at debugging time.

当您编写代码时,您无疑会遇到几个都需要访问的函数 受特定锁保护的结构。此时,您必须小心:如果一个函数获取锁,然后调用另一个也尝试获取锁的函数,则您的代码将发生死锁。信号量和自旋锁都不允许锁持有者第二次获取锁;如果您尝试这样做,事情就会挂起。

As you write your code, you will doubtless encounter several functions that all require access to structures protected by a specific lock. At this point, you must be careful: if one function acquires a lock and then calls another function that also attempts to acquire the lock, your code deadlocks. Neither semaphores nor spinlocks allow a lock holder to acquire the lock a second time; should you attempt to do so, things simply hang.

为了使锁定正常工作,您必须编写一些函数,并假设它们的调用者已经获取了相关的锁。通常,只有内部静态函数可以用这种方式编写;从外部调用的函数必须显式处理锁定。当您编写对锁定做出假设的内部函数时,请帮您自己(以及使用您的代码的任何其他人)一个忙,并明确记录这些假设。几个月后回过头来弄清楚是否需要持有锁来调用特定函数可能非常困难。

To make your locking work properly, you have to write some functions with the assumption that their caller has already acquired the relevant lock(s). Usually, only your internal, static functions can be written in this way; functions called from outside must handle locking explicitly. When you write internal functions that make assumptions about locking, do yourself (and anybody else who works with your code) a favor and document those assumptions explicitly. It can be very hard to come back months later and figure out whether you need to hold a lock to call a particular function or not.

就scull而言,所采取的设计决策是要求直接从系统调用调用的所有函数来获取应用于所访问的设备结构的信号量。所有仅从其他scull函数调用的内部函数都可以假设已正确获取信号量。

In the case of scull, the design decision taken was to require all functions invoked directly from system calls to acquire the semaphore applying to the device structure that is accessed. All internal functions, which are only called from other scull functions, can then assume that the semaphore has been properly acquired.

锁排序规则

Lock Ordering Rules

在具有大容量的系统中 锁的数量(内核正在成为这样一个系统),代码需要同时持有多个锁的情况并不罕见。如果某种计算必须使用两个不同的资源执行,每个资源都有自己的锁,那么通常没有其他方法可以获取两个锁。

In systems with a large number of locks (and the kernel is becoming such a system), it is not unusual for code to need to hold more than one lock at once. If some sort of computation must be performed using two different resources, each of which has its own lock, there is often no alternative to acquiring both locks.

然而,获取多个锁可能很危险。如果您有两个锁(称为 Lock1Lock2),并且代码需要同时获取这两个锁,则可能会出现死锁。想象一下一个线程锁定 Lock1,而另一个线程同时获取 Lock2。然后每个线程都会尝试获取它没有的线程。两个线程会陷入僵局。

Taking multiple locks can be dangerous, however. If you have two locks, called Lock1 and Lock2, and code needs to acquire both at the same time, you have a potential deadlock. Just imagine one thread locking Lock1 while another simultaneously takes Lock2. Then each thread tries to get the one it doesn't have. Both threads will deadlock.

此问题的解决方案通常很简单:当必须获取多个锁时,应始终以相同的顺序获取它们。只要遵循这一约定,就可以避免像上述那样的简单死锁。然而,遵循锁排序规则说起来容易做起来难。很少有地方真正写下这样的规则。通常,您能做的最好的事情就是看看其他代码做了什么。

The solution to this problem is usually simple: when multiple locks must be acquired, they should always be acquired in the same order. As long as this convention is followed, simple deadlocks like the one described above can be avoided. However, following lock ordering rules can be easier said than done. It is very rare that such rules are actually written down anywhere. Often the best you can do is to see what other code does.

一些经验法则可以提供帮助。如果您必须获取代码本地的锁(例如设备锁)以及属于内核更核心部分的锁,请首先获取您的锁。如果您有信号量和自旋锁的组合,那么您当然必须首先获取信号量;在持有自旋锁的情况下调用down(可以休眠)是一个严重的错误。但最重要的是,尽量避免需要多个锁的情况。

A couple of rules of thumb can help. If you must obtain a lock that is local to your code (a device lock, say) along with a lock belonging to a more central part of the kernel, take your lock first. If you have a combination of semaphores and spinlocks, you must, of course, obtain the semaphore(s) first; calling down (which can sleep) while holding a spinlock is a serious error. But most of all, try to avoid situations where you need more than one lock.

细粒度锁定与粗粒度锁定

Fine- Versus Coarse-Grained Locking

第一个支持的 Linux 内核 多处理器系统是 2.0;它只包含一个自旋锁。大内核锁将整个内核变成了一个大的临界区;在任何给定时间只有一个 CPU 可以执行内核代码。该锁很好地解决了并发问题,使内核开发人员能够解决支持 SMP 所涉及的所有其他问题。但它的扩展性不是很好。即使是双处理器系统也可能花费大量时间来等待大内核锁。四处理器系统的性能甚至无法接近四台独立机器的性能。

The first Linux kernel that supported multiprocessor systems was 2.0; it contained exactly one spinlock. The big kernel lock turned the entire kernel into one large critical section; only one CPU could be executing kernel code at any given time. This lock solved the concurrency problem well enough to allow the kernel developers to address all of the other issues involved in supporting SMP. But it did not scale very well. Even a two-processor system could spend a significant amount of time simply waiting for the big kernel lock. The performance of a four-processor system was not even close to that of four independent machines.

因此,后续的内核版本包含了更细粒度的锁定。在2.2中,一个自旋锁控制对块I/O子系统的访问;另一个负责网络工作,等等。现代内核可以包含数千个锁,每个锁保护一小部分资源。这种细粒度的锁定有利于可扩展性;它允许每个处理器处理其特定任务,而无需争用其他处理器使用的锁。很少有人会错过大内核锁。[ 3 ]

So, subsequent kernel releases have included finer-grained locking. In 2.2, one spinlock controlled access to the block I/O subsystem; another worked for networking, and so on. A modern kernel can contain thousands of locks, each protecting one small resource. This sort of fine-grained locking can be good for scalability; it allows each processor to work on its specific task without contending for locks used by other processors. Very few people miss the big kernel lock.[3]

然而,细粒度锁定是有代价的。在具有数千个锁的内核中,很难知道您需要哪些锁以及应该以什么顺序获取它们来执行特定操作。请记住,锁定错误可能很难找到;更多的锁为真正令人讨厌的锁定错误侵入内核提供了更多的机会。细粒度锁定会带来一定程度的复杂性,从长远来看,可能会对内核的可维护性产生巨大的不利影响。

Fine-grained locking comes at a cost, however. In a kernel with thousands of locks, it can be very hard to know which locks you need—and in which order you should acquire them—to perform a specific operation. Remember that locking bugs can be very difficult to find; more locks provide more opportunities for truly nasty locking bugs to creep into the kernel. Fine-grained locking can bring a level of complexity that, over the long term, can have a large, adverse effect on the maintainability of the kernel.

锁定设备驱动程序通常相对简单;您可以拥有一把锁来涵盖您所做的一切,也可以为您管理的每台设备创建一把锁。作为一般规则,您应该从相对粗略的锁定开始,除非您有真正的理由相信争用可能是一个问题。抵制过早优化的冲动;真正的性能限制常常出现在意想不到的地方。

Locking in a device driver is usually relatively straightforward; you can have a single lock that covers everything you do, or you can create one lock for every device you manage. As a general rule, you should start with relatively coarse locking unless you have a real reason to believe that contention could be a problem. Resist the urge to optimize prematurely; the real performance constraints often show up in unexpected places.

如果您确实怀疑锁争用正在损害性能,您可能会找到 lockmeter 工具很有用。该补丁(可从http://oss.sgi.com/projects/lockmeter/获取)对内核进行检测,以测量锁定等待所花费的时间。通过查看报告,您可以快速确定是否存在锁争用 到底是不是问题。

If you do suspect that lock contention is hurting performance, you may find the lockmeter tool useful. This patch (available at http://oss.sgi.com/projects/lockmeter/) instruments the kernel to measure time spent waiting in locks. By looking at the report, you are able to determine quickly whether lock contention is truly the problem or not.

锁定的替代方案

Alternatives to Locking

Linux内核提供了一个 许多强大的锁定原语可用于防止内核绊倒。但是,正如我们所看到的,锁定方案的设计和实现并非没有缺陷。通常除了信号量和自旋锁之外别无选择;它们可能是正确完成工作的唯一方法。然而,在某些情况下,可以设置原子访问而不需要完全锁定。本节着眼于其他做事方式。

The Linux kernel provides a number of powerful locking primitives that can be used to keep the kernel from tripping over its own feet. But, as we have seen, the design and implementation of a locking scheme is not without its pitfalls. Often there is no alternative to semaphores and spinlocks; they may be the only way to get the job done properly. There are situations, however, where atomic access can be set up without the need for full locking. This section looks at other ways of doing things.

无锁算法

Lock-Free Algorithms

有时,你可以重新塑造你的 算法来完全避免锁定的需要。许多读取器/写入器情况(如果只有一个写入器)通常可以以这种方式工作。如果编写者注意读者所看到的数据结构视图始终一致,则可以创建无锁数据结构。

Sometimes, you can recast your algorithms to avoid the need for locking altogether. A number of reader/writer situations—if there is only one writer—can often work in this manner. If the writer takes care that the view of the data structure, as seen by the reader, is always consistent, it may be possible to create a lock-free data structure.

对于无锁生产者/消费者任务通常有用的数据结构是 循环缓冲区 。该算法涉及生产者将数据放入数组的一端,而消费者则从另一端删除数据。当到达数组末尾时,生产者会回绕到开头。因此,循环缓冲区需要一个数组和两个索引值来跟踪下一个新值的去向以及接下来应该从缓冲区中删除哪个值。

A data structure that can often be useful for lockless producer/consumer tasks is the circular buffer . This algorithm involves a producer placing data into one end of an array, while the consumer removes data from the other. When the end of the array is reached, the producer wraps back around to the beginning. So a circular buffer requires an array and two index values to track where the next new value goes and which value should be removed from the buffer next.

如果仔细实现,循环缓冲区在没有多个生产者或消费者的情况下不需要锁定。生产者是唯一允许修改写入索引及其指向的数组位置的线程。只要写入器在更新写入索引之前将新值存储到缓冲区中,读取器将始终看到一致的视图。反过来,读取器是唯一可以访问读取索引及其指向的值的线程。只要小心确保两个指针不会互相溢出,生产者和消费者就可以同时访问缓冲区,而不会出现竞争条件。

When carefully implemented, a circular buffer requires no locking in the absence of multiple producers or consumers. The producer is the only thread that is allowed to modify the write index and the array location it points to. As long as the writer stores a new value into the buffer before updating the write index, the reader will always see a consistent view. The reader, in turn, is the only thread that can access the read index and the value it points to. With a bit of care to ensure that the two pointers do not overrun each other, the producer and the consumer can access the buffer concurrently with no race conditions.

图 5-1显示了几种填充状态下的循环缓冲区。该缓冲区已定义为空条件由读指针和写指针相等指示,而只要写指针紧接在读指针后面(请小心考虑换行!),就会发生满条件。当仔细编程时,该缓冲区可以在没有锁的情况下使用。

Figure 5-1 shows circular buffer in several states of fill. This buffer has been defined such that an empty condition is indicated by the read and write pointers being equal, while a full condition happens whenever the write pointer is immediately behind the read pointer (being careful to account for a wrap!). When carefully programmed, this buffer can be used without locks.

循环缓冲区

图 5-1。循环缓冲区

Figure 5-1. A circular buffer

循环缓冲区经常出现在设备驱动程序中。特别是网络适配器经常使用循环缓冲区与处理器交换数据(数据包)。请注意,从 2.6.10 开始,内核中有一个通用的循环缓冲区实现可用;有关如何使用它的信息,请参阅<linux/kfifo.h> 。

Circular buffers show up reasonably often in device drivers. Networking adaptors, in particular, often use circular buffers to exchange data (packets) with the processor. Note that, as of 2.6.10, there is a generic circular buffer implementation available in the kernel; see <linux/kfifo.h> for information on how to use it.

原子变量

Atomic Variables

有时,共享资源 是一个简单的整数值。假设您的驱动程序维护一个共享变量n_op,该变量告诉当前有多少设备操作未完成。通常,即使是简单的操作,例如:

Sometimes, a shared resource is a simple integer value. Suppose your driver maintains a shared variable n_op that tells how many device operations are currently outstanding. Normally, even a simple operation such as:

n_op++;
n_op++;

需要锁定。某些处理器可能会以原子方式执行此类增量,但您不能指望它。但对于一个简单的整数值来说,完整的锁定机制似乎是开销。对于这种情况,内核提供了一个名为 的原子整数类型,在<asm/atomic.h>atomic_t中定义。

would require locking. Some processors might perform that sort of increment in an atomic manner, but you can't count on it. But a full locking regime seems like overhead for a simple integer value. For cases like this, the kernel provides an atomic integer type called atomic_t, defined in <asm/atomic.h>.

An在所有受支持的体系结构上都atomic_t具有一个int值。然而,由于这种类型在某些处理器上的工作方式,完整的整数范围可能不可用;因此,您不应指望保存atomic_t超过 24 位。以下操作是针对该类型定义的,并且保证对于 SMP 计算机的所有处理器而言是原子的。这些操作非常快,因为它们尽可能编译为单个机器指令。

An atomic_t holds an int value on all supported architectures. Because of the way this type works on some processors, however, the full integer range may not be available; thus, you should not count on an atomic_t holding more than 24 bits. The following operations are defined for the type and are guaranteed to be atomic with respect to all processors of an SMP computer. The operations are very fast, because they compile to a single machine instruction whenever possible.

void atomic_set(atomic_t *v, int i);

atomic_t v = ATOMIC_INIT(0);
void atomic_set(atomic_t *v, int i);

atomic_t v = ATOMIC_INIT(0);

将原子变量设置v为整数值 i。您还可以在编译时使用ATOMIC_INIT宏初始化原子值。

Set the atomic variable v to the integer value i. You can also initialize atomic values at compile time with the ATOMIC_INIT macro.

int atomic_read(atomic_t *v);
int atomic_read(atomic_t *v);

返回 的当前值v

Return the current value of v.

void atomic_add(int i, atomic_t *v);
void atomic_add(int i, atomic_t *v);

添加 i到 指向的原子变量v。返回值为void,因为返回新值会产生额外的成本,而且大多数时候不需要知道它。

Add i to the atomic variable pointed to by v. The return value is void, because there is an extra cost to returning the new value, and most of the time there's no need to know it.

void atomic_sub(int i, atomic_t *v);
void atomic_sub(int i, atomic_t *v);

i从 中减去 *v

Subtract i from *v.

void atomic_inc(atomic_t *v);

void atomic_dec(atomic_t *v);
void atomic_inc(atomic_t *v);

void atomic_dec(atomic_t *v);

递增或递减原子变量。

Increment or decrement an atomic variable.

int atomic_inc_and_test(atomic_t *v);

int atomic_dec_and_test(atomic_t *v);

int atomic_sub_and_test(int i, atomic_t *v);
int atomic_inc_and_test(atomic_t *v);

int atomic_dec_and_test(atomic_t *v);

int atomic_sub_and_test(int i, atomic_t *v);

执行指定的操作并测试结果;如果运算后原子值为0,则返回值为true;否则,它是错误的。请注意,没有 atomic_add_and_test

Perform the specified operation and test the result; if, after the operation, the atomic value is 0, then the return value is true; otherwise, it is false. Note that there is no atomic_add_and_test.

int atomic_add_negative(int i, atomic_t *v);
int atomic_add_negative(int i, atomic_t *v);

将整数变量添加iv. 如果结果为负,则返回值为 true,否则返回 false。

Add the integer variable i to v. The return value is true if the result is negative, false otherwise.

int atomic_add_return(int i, atomic_t *v);

int atomic_sub_return(int i, atomic_t *v);

int atomic_inc_return(atomic_t *v);

int atomic_dec_return(atomic_t *v);
int atomic_add_return(int i, atomic_t *v);

int atomic_sub_return(int i, atomic_t *v);

int atomic_inc_return(atomic_t *v);

int atomic_dec_return(atomic_t *v);

行为就像atomic_add和朋友一样,不同之处在于它们将原子变量的新值返回给调用者。

Behave just like atomic_add and friends, with the exception that they return the new value of the atomic variable to the caller.

如前所述,atomic_t数据项只能通过这些函数访问。如果将原子项传递给需要整数参数的函数,则会出现编译器错误。

As stated earlier, atomic_t data items must be accessed only through these functions. If you pass an atomic item to a function that expects an integer argument, you'll get a compiler error.

您还应该记住,atomic_t只有当所讨论的数量确实是原子的时,值才起作用。需要多个变量的操作 atomic_t仍然需要某种其他类型的锁定。考虑以下代码:

You should also bear in mind that atomic_t values work only when the quantity in question is truly atomic. Operations requiring multiple atomic_t variables still require some other sort of locking. Consider the following code:

atomic_sub(金额, &first_atomic);
atomic_add(金额, &second_atomic);
atomic_sub(amount, &first_atomic);
atomic_add(amount, &second_atomic);

在一段时间内,已从amount第一个原子值中减去 ,但尚未添加到第二个原子值中。如果这种情况可能会给两个操作之间运行的代码带来麻烦,则必须采用某种形式的锁定。

There is a period of time where the amount has been subtracted from the first atomic value but not yet added to the second. If that state of affairs could create trouble for code that might run between the two operations, some form of locking must be employed.

位运算

Bit Operations

atomic_t类型适合表演 整数算术。然而,当您需要以原子方式操作各个位时,它就不起作用了。为此,内核提供了一组以原子方式修改或测试单个位的函数。因为整个操作发生在一个步骤中,所以没有中断(或其他处理器)可以干扰。

The atomic_t type is good for performing integer arithmetic. It doesn't work as well, however, when you need to manipulate individual bits in an atomic manner. For that purpose, instead, the kernel offers a set of functions that modify or test single bits atomically. Because the whole operation happens in a single step, no interrupt (or other processor) can interfere.

原子位操作非常快,因为它们使用单​​个机器指令执行操作,而无需在底层平台可以执行此操作时禁用中断。这些函数与体系结构相关,并在<asm/bitops.h>中声明。即使在 SMP 计算机上,它们也保证是原子的,并且对于保持处理器之间的一致性很有用。

Atomic bit operations are very fast, since they perform the operation using a single machine instruction without disabling interrupts whenever the underlying platform can do that. The functions are architecture dependent and are declared in <asm/bitops.h>. They are guaranteed to be atomic even on SMP computers and are useful to keep coherence across processors.

不幸的是,这些函数中的数据类型也依赖于体系结构。参数 nr(描述要操作的位)通常定义为intunsigned long适用于某些架构。要修改的地址通常是指向 的指针unsigned long,但也有少数体系结构使用 void *它。

Unfortunately, data typing in these functions is architecture dependent as well. The nr argument (describing which bit to manipulate) is usually defined as int but is unsigned long for a few architectures. The address to be modified is usually a pointer to unsigned long, but a few architectures use void * instead.

可用的位操作有:

The available bit operations are:

void set_bit(nr, void *addr);
void set_bit(nr, void *addr);

nr设置指向的数据项中的 位数addr

Sets bit number nr in the data item pointed to by addr.

void clear_bit(nr, void *addr);
void clear_bit(nr, void *addr);

unsigned long清除位于 处的数据中的指定位addr其语义在其他方面与set_bit相同。

Clears the specified bit in the unsigned long datum that lives at addr. Its semantics are otherwise the same as set_bit.

void change_bit(nr, void *addr);
void change_bit(nr, void *addr);

切换位。

Toggles the bit.

test_bit(nr, void *addr);
test_bit(nr, void *addr);

该函数是唯一不需要原子的位操作;它只是返回该位的当前值。

This function is the only bit operation that doesn't need to be atomic; it simply returns the current value of the bit.

int test_and_set_bit(nr, void *addr);

int test_and_clear_bit(nr, void *addr);

int test_and_change_bit(nr, void *addr);
int test_and_set_bit(nr, void *addr);

int test_and_clear_bit(nr, void *addr);

int test_and_change_bit(nr, void *addr);

其行为与前面列出的原子行为类似,只是它们还返回该位的先前值。

Behave atomically like those listed previously, except that they also return the previous value of the bit.

当这些函数用于访问和修改共享标志时,除了调用它们之外,您无需执行任何操作;他们以原子方式执行操作。另一方面,使用位操作来管理控制对共享变量的访问的锁变量则稍微复杂一些,值得举例。大多数现代代码不以这种方式使用位操作,但类似以下的代码仍然存在于内核中。

When these functions are used to access and modify a shared flag, you don't have to do anything except call them; they perform their operations in an atomic manner. Using bit operations to manage a lock variable that controls access to a shared variable, on the other hand, is a little more complicated and deserves an example. Most modern code does not use bit operations in this way, but code like the following still exists in the kernel.

需要访问共享数据项的代码段尝试使用test_and_set_bittest_and_clear_bit以原子方式获取锁。通常的实现如下所示;它假设锁位于nr地址位 addr。它还假设该位0在锁空闲时为非零,或者在锁忙时该位为非零。

A code segment that needs to access a shared data item tries to atomically acquire a lock using either test_and_set_bit or test_and_clear_bit. The usual implementation is shown here; it assumes that the lock lives at bit nr of address addr. It also assumes that the bit is 0 when the lock is free or nonzero when the lock is busy.

/* 尝试设置锁 */
while (test_and_set_bit(nr, addr) != 0)
    稍等片刻( );

/* 做你的工作 */

/* 释放锁,并检查... */
if (test_and_clear_bit(nr, addr) == 0)
    出了些问题( ); /* 已经发布: 错误 */
/* try to set lock */
while (test_and_set_bit(nr, addr) != 0)
    wait_for_a_while(  );

/* do your work */

/* release lock, and check... */
if (test_and_clear_bit(nr, addr) =  = 0)
    something_went_wrong(  ); /* already released: error */

如果您通读内核源代码,您会发现与此示例类似的代码。然而,在新代码中使用自旋锁要好得多;自旋锁经过良好的调试,它们可以处理中断和内核抢占等问题,并且其他阅读您代码的人不必费力就能理解您在做什么。

If you read through the kernel source, you find code that works like this example. It is, however, far better to use spinlocks in new code; spinlocks are well debugged, they handle issues like interrupts and kernel preemption, and others reading your code do not have to work to understand what you are doing.

序列锁

seqlocks

2.6内核包含 一些旨在提供对共享资源的快速、无锁访问的新机制。Seqlock 适用于要保护的资源较小、简单且频繁访问的情况,以及写入访问很少但必须快速的情况。本质上,它们的工作原理是允许读者自由访问资源,但要求这些读者检查与作者的冲突,并在发生此类冲突时重试访问。Seqlock 通常不能用于保护涉及指针的数据结构,因为当写入器更改数据结构时,读取器可能会跟踪无效的指针。

The 2.6 kernel contains a couple of new mechanisms that are intended to provide fast, lockless access to a shared resource. Seqlocks work in situations where the resource to be protected is small, simple, and frequently accessed, and where write access is rare but must be fast. Essentially, they work by allowing readers free access to the resource but requiring those readers to check for collisions with writers and, when such a collision happens, retry their access. Seqlocks generally cannot be used to protect data structures involving pointers, because the reader may be following a pointer that is invalid while the writer is changing the data structure.

Seqlock 在<linux/seqlock.h>中定义。有两种常用的初始化方法seqlock(类型为seqlock_t):

Seqlocks are defined in <linux/seqlock.h>. There are the two usual methods for initializing a seqlock (which has type seqlock_t):

seqlock_t 锁1 = SEQLOCK_UNLOCKED;

seqlock_t 锁2;
seqlock_init(&lock2);
seqlock_t lock1 = SEQLOCK_UNLOCKED;

seqlock_t lock2;
seqlock_init(&lock2);

读访问通过在进入临界区时获取(无符号)整数序列值来进行。退出时,将该序列值与当前值进行比较;如果不匹配,则必须重试读取访问。因此,阅读器代码的形式如下:

Read access works by obtaining an (unsigned) integer sequence value on entry into the critical section. On exit, that sequence value is compared with the current value; if there is a mismatch, the read access must be retried. As a result, reader code has a form like the following:

无符号整型序列;

做 {
    seq = read_seqbegin(&the_lock);
    /* 做你需要做的事 */
while read_seqretry(&the_lock, seq);
unsigned int seq;

do {
    seq = read_seqbegin(&the_lock);
    /* Do what you need to do */
} while read_seqretry(&the_lock, seq);

这种锁通常用于保护某种需要多个一致值的简单计算。如果计算结束时的测试表明发生了并发写入,则可以简单地丢弃结果并重新计算。

This sort of lock is usually used to protect some sort of simple computation that requires multiple, consistent values. If the test at the end of the computation shows that a concurrent write occurred, the results can be simply discarded and recomputed.

如果您的 seqlock 可以从中断处理程序访问,您应该使用 IRQ 安全版本:

If your seqlock might be accessed from an interrupt handler, you should use the IRQ-safe versions instead:

无符号 int read_seqbegin_irqsave(seqlock_t *lock,
                                   无符号长标志);
int read_seqretry_irqrestore(seqlock_t *lock, 无符号 int seq,
                             无符号长标志);
unsigned int read_seqbegin_irqsave(seqlock_t *lock, 
                                   unsigned long flags);
int read_seqretry_irqrestore(seqlock_t *lock, unsigned int seq,
                             unsigned long flags);

写入者必须获得独占锁才能进入受 seqlock 保护的临界区。为此,请致电:

Writers must obtain an exclusive lock to enter the critical section protected by a seqlock. To do so, call:

无效 write_seqlock(seqlock_t *lock);
void write_seqlock(seqlock_t *lock);

写锁是通过自旋锁实现的,因此所有常见的约束都适用。拨打电话:

The write lock is implemented with a spinlock, so all the usual constraints apply. Make a call to:

无效 write_sequnlock(seqlock_t *lock);
void write_sequnlock(seqlock_t *lock);

释放锁。由于自旋锁用于控制写访问,因此所有常见的变体都可用:

to release the lock. Since spinlocks are used to control write access, all of the usual variants are available:

void write_seqlock_irqsave(seqlock_t *lock, unsigned long flags);
无效 write_seqlock_irq(seqlock_t *lock);
无效 write_seqlock_bh(seqlock_t *lock);

void write_sequnlock_irqrestore(seqlock_t *lock, unsigned long flags);
无效 write_sequnlock_irq(seqlock_t *lock);
无效 write_sequnlock_bh(seqlock_t *lock);
void write_seqlock_irqsave(seqlock_t *lock, unsigned long flags);
void write_seqlock_irq(seqlock_t *lock);
void write_seqlock_bh(seqlock_t *lock);

void write_sequnlock_irqrestore(seqlock_t *lock, unsigned long flags);
void write_sequnlock_irq(seqlock_t *lock);
void write_sequnlock_bh(seqlock_t *lock);

还有一个write_tryseqlock如果能够获得锁则返回非零。

There is also a write_tryseqlock that returns nonzero if it was able to obtain the lock.

读取-复制-更新

Read-Copy-Update

读取-复制-更新 (RCU) 是一种先进的互斥方案,可以在适当的条件下产生高性能。它在驱动程序中的使用很少但并非未知,因此值得在这里快速概述。对 RCU 算法的完整细节感兴趣的人可以在其创建者发布的白皮书 ( http://www.rdrop.com/users/paulmck/rclock/intro/rclock_intro.html ) 中找到它们。

Read-copy-update (RCU) is an advanced mutual exclusion scheme that can yield high performance in the right conditions. Its use in drivers is rare but not unknown, so it is worth a quick overview here. Those who are interested in the full details of the RCU algorithm can find them in the white paper published by its creator (http://www.rdrop.com/users/paulmck/rclock/intro/rclock_intro.html).

RCU 对它可以保护的数据结构类型设置了许多限制。它针对读取常见而写入很少的情况进行了优化。受保护的资源应通过指针访问,并且对这些资源的所有引用必须仅由原子代码保存。当数据结构需要更改时,写入线程会制作一个副本,更改副本,然后将相关指针指向新版本——这就是算法的名称。当内核确定不再保留对旧版本的引用时,可以将其释放。

RCU places a number of constraints on the sort of data structure that it can protect. It is optimized for situations where reads are common and writes are rare. The resources being protected should be accessed via pointers, and all references to those resources must be held only by atomic code. When the data structure needs to be changed, the writing thread makes a copy, changes the copy, then aims the relevant pointer at the new version—thus, the name of the algorithm. When the kernel is sure that no references to the old version remain, it can be freed.

作为 RCU 实际使用的示例,请考虑网络路由表。每个传出数据包都需要检查路由表以确定应使用哪个接口。检查速度很快,并且一旦内核找到目标接口,它就不再需要路由表条目。RCU 允许在不锁定的情况下执行路由查找,从而具有显着的性能优势。内核中的 Starmode 无线电 IP 驱动程序还使用 RCU 来跟踪其设备列表。

As an example of real-world use of RCU, consider the network routing tables. Every outgoing packet requires a check of the routing tables to determine which interface should be used. The check is fast, and, once the kernel has found the target interface, it no longer needs the routing table entry. RCU allows route lookups to be performed without locking, with significant performance benefits. The Starmode radio IP driver in the kernel also uses RCU to keep track of its list of devices.

使用 RCU 的代码应包含<linux/rcupdate.h>

Code using RCU should include <linux/rcupdate.h>.

在读取方面,使用 RCU 保护的数据结构的代码应将其引用与对rcu_read_lockrcu_read_unlock的调用括起来。因此,RCU 代码往往如下所示:

On the read side, code using an RCU-protected data structure should bracket its references with calls to rcu_read_lock and rcu_read_unlock. As a result, RCU code tends to look like:

结构 my_stuff *stuff;

rcu_read_lock();
stuff = find_the_stuff(args...);
do_something_with(东西);
rcu_read_unlock();
struct my_stuff *stuff;

rcu_read_lock(  );
stuff = find_the_stuff(args...);
do_something_with(stuff);
rcu_read_unlock(  );

rcu_read_lock调用速度很快;它禁用内核抢占但不等待任何事情。持有读“锁”时执行的代码必须是原子的。调用 rcu_read_unlock后,不得使用对受保护资源的引用。

The rcu_read_lock call is fast; it disables kernel preemption but does not wait for anything. The code that executes while the read "lock" is held must be atomic. No reference to the protected resource may be used after the call to rcu_read_unlock.

需要更改受保护结构的代码必须执行几个步骤。第一部分很简单;它分配一个新的结构,如果需要,从旧的结构复制数据,然后替换读取代码看到的指针。至此,对于读端而言,更改完成;任何进入临界区的代码都会看到新版本的数据。

Code that needs to change the protected structure has to carry out a few steps. The first part is easy; it allocates a new structure, copies data from the old one if need be, then replaces the pointer that is seen by the read code. At this point, for the purposes of the read side, the change is complete; any code entering the critical section sees the new version of the data.

剩下的就是释放旧版本。当然,问题是在其他处理器上运行的代码可能仍然引用旧数据,因此无法立即释放它。相反,写入代码必须等到它知道不存在此类引用为止。由于保存对此数据结构的引用的所有代码都必须(按照规则)是原子的,因此我们知道,一旦系统上的每个处理器至少被调度一次,所有引用都必须消失。这就是 RCU 所做的;它预留一个回调,等待所有处理器都已调度;然后运行该回调来执行清理工作。

All that remains is to free the old version. The problem, of course, is that code running on other processors may still have a reference to the older data, so it cannot be freed immediately. Instead, the write code must wait until it knows that no such reference can exist. Since all code holding references to this data structure must (by the rules) be atomic, we know that once every processor on the system has been scheduled at least once, all references must be gone. So that is what RCU does; it sets aside a callback that waits until all processors have scheduled; that callback is then run to perform the cleanup work.

更改受 RCU 保护的数据结构的代码必须通过分配 a 来获取其清理回调struct rcu_head,尽管它不需要以任何方式初始化该结构。通常,该结构只是嵌入到受 RCU 保护的较大资源中。对该资源的更改完成后,应调用:

Code that changes an RCU-protected data structure must get its cleanup callback by allocating a struct rcu_head, although it doesn't need to initialize that structure in any way. Often, that structure is simply embedded within the larger resource that is protected by RCU. After the change to that resource is complete, a call should be made to:

void call_rcu(struct rcu_head *head, void (*func)(void *arg), void *arg);
void call_rcu(struct rcu_head *head, void (*func)(void *arg), void *arg);

func当可以安全释放资源时调用给定的;它传递给arg传递给call_rcu的同一个。通常,唯一func需要做的就是调用kfree

The given func is called when it is safe to free the resource; it is passed to the same arg that was passed to call_rcu. Usually, the only thing func needs to do is to call kfree.

完整的 RCU 接口比我们在这里看到的更复杂;例如,它包括实用函数 使用受保护的链接列表。有关完整内容,请参阅相关头文件。

The full RCU interface is more complex than we have seen here; it includes, for example, utility functions for working with protected linked lists. See the relevant header files for the full story.

快速参考

Quick Reference

本章介绍了一组用于并发管理的重要符号。其中最重要的总结如下:

This chapter has introduced a substantial set of symbols for the management of concurrency. The most important of these are summarized here:

#include<asm/semaphore.h>
#include <asm/semaphore.h>

定义信号量及其操作的包含文件。

The include file that defines semaphores and the operations on them.

DECLARE_MUTEX(name);

DECLARE_MUTEX_LOCKED(name);
DECLARE_MUTEX(name);

DECLARE_MUTEX_LOCKED(name);

用于声明和初始化互斥模式下使用的信号量的两个宏。

Two macros for declaring and initializing a semaphore used in mutual exclusion mode.

void init_MUTEX(struct semaphore *sem);

void init_MUTEX_LOCKED(struct semaphore *sem);
void init_MUTEX(struct semaphore *sem);

void init_MUTEX_LOCKED(struct semaphore *sem);

这两个函数可用于在运行时初始化信号量。

These two functions can be used to initialize a semaphore at runtime.

void down(struct semaphore *sem);

int down_interruptible(struct semaphore *sem);

int down_trylock(struct semaphore *sem);

void up(struct semaphore *sem);
void down(struct semaphore *sem);

int down_interruptible(struct semaphore *sem);

int down_trylock(struct semaphore *sem);

void up(struct semaphore *sem);

锁定和解锁信号量。如果需要的话,down使调用进程进入不间断的睡眠状态;down_interruptible相反,可以被信号中断。down_trylock不休眠;相反,如果信号量不可用,它会立即返回。锁定信号量的代码最终必须使用up来解锁它。

Lock and unlock a semaphore. down puts the calling process into an uninterruptible sleep if need be; down_interruptible, instead, can be interrupted by a signal. down_trylock does not sleep; instead, it returns immediately if the semaphore is unavailable. Code that locks a semaphore must eventually unlock it with up.

struct rw_semaphore;

init_rwsem(struct rw_semaphore *sem);
struct rw_semaphore;

init_rwsem(struct rw_semaphore *sem);

信号量的读取器/写入器版本以及初始化它的函数。

The reader/writer version of semaphores and the function that initializes it.

void down_read(struct rw_semaphore *sem);

int down_read_trylock(struct rw_semaphore *sem);

void up_read(struct rw_semaphore *sem);
void down_read(struct rw_semaphore *sem);

int down_read_trylock(struct rw_semaphore *sem);

void up_read(struct rw_semaphore *sem);

用于获取和释放对读取器/写入器信号量的读取访问权限的函数。

Functions for obtaining and releasing read access to a reader/writer semaphore.

void down_write(struct rw_semaphore *sem)

int down_write_trylock(struct rw_semaphore *sem)

void up_write(struct rw_semaphore *sem)

void downgrade_write(struct rw_semaphore *sem)
void down_write(struct rw_semaphore *sem)

int down_write_trylock(struct rw_semaphore *sem)

void up_write(struct rw_semaphore *sem)

void downgrade_write(struct rw_semaphore *sem)

用于管理对读取器/写入器信号量的写入访问的函数。

Functions for managing write access to a reader/writer semaphore.

#include <linux/completion.h>

DECLARE_COMPLETION(name);

init_completion(struct completion *c);

INIT_COMPLETION(struct completion c);
#include <linux/completion.h>

DECLARE_COMPLETION(name);

init_completion(struct completion *c);

INIT_COMPLETION(struct completion c);

描述 Linux 完成机制以及初始化完成的正常方法的包含文件。INIT_COMPLETION 应该仅用于重新初始化以前使用过的完成。

The include file describing the Linux completion mechanism, and the normal methods for initializing completions. INIT_COMPLETION should be used only to reinitialize a completion that has been previously used.

void wait_for_completion(struct completion *c);
void wait_for_completion(struct completion *c);

等待发出完成事件信号。

Wait for a completion event to be signalled.

void complete(struct completion *c);

void complete_all(struct completion *c);
void complete(struct completion *c);

void complete_all(struct completion *c);

发出完成事件信号。Complete最多唤醒一个等待线程,而complete_all 则唤醒所有等待者。

Signal a completion event. complete wakes, at most, one waiting thread, while complete_all wakes all waiters.

void complete_and_exit(struct completion *c, long retval);
void complete_and_exit(struct completion *c, long retval);

通过调用complete来发出完成事件信号并调用 当前线程的exit 。

Signals a completion event by calling complete and calls exit for the current thread.

#include <linux/spinlock.h>

spinlock_t lock = SPIN_LOCK_UNLOCKED;

spin_lock_init(spinlock_t *lock);
#include <linux/spinlock.h>

spinlock_t lock = SPIN_LOCK_UNLOCKED;

spin_lock_init(spinlock_t *lock);

定义自旋锁接口和初始化锁的两种方式的包含文件。

The include file defining the spinlock interface and the two ways of initializing locks.

void spin_lock(spinlock_t *lock);

void spin_lock_irqsave(spinlock_t *lock, unsigned long flags);

void spin_lock_irq(spinlock_t *lock);

void spin_lock_bh(spinlock_t *lock);
void spin_lock(spinlock_t *lock);

void spin_lock_irqsave(spinlock_t *lock, unsigned long flags);

void spin_lock_irq(spinlock_t *lock);

void spin_lock_bh(spinlock_t *lock);

锁定自旋锁并可能禁用中断的各种方法。

The various ways of locking a spinlock and, possibly, disabling interrupts.

int spin_trylock(spinlock_t *lock);

int spin_trylock_bh(spinlock_t *lock);
int spin_trylock(spinlock_t *lock);

int spin_trylock_bh(spinlock_t *lock);

上述功能的非旋转版本;0如果无法获取锁,则返回这些值,否则返回非零值。

Nonspinning versions of the above functions; these return 0 in case of failure to obtain the lock, nonzero otherwise.

void spin_unlock(spinlock_t *lock);

void spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags);

void spin_unlock_irq(spinlock_t *lock);

void spin_unlock_bh(spinlock_t *lock);
void spin_unlock(spinlock_t *lock);

void spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags);

void spin_unlock_irq(spinlock_t *lock);

void spin_unlock_bh(spinlock_t *lock);

释放自旋锁的相应方式。

The corresponding ways of releasing a spinlock.

rwlock_t lock = RW_LOCK_UNLOCKED

rwlock_init(rwlock_t *lock);
rwlock_t lock = RW_LOCK_UNLOCKED

rwlock_init(rwlock_t *lock);

初始化读/写锁的两种方法。

The two ways of initializing reader/writer locks.

void read_lock(rwlock_t *lock);

void read_lock_irqsave(rwlock_t *lock, unsigned long flags);

void read_lock_irq(rwlock_t *lock);

void read_lock_bh(rwlock_t *lock);
void read_lock(rwlock_t *lock);

void read_lock_irqsave(rwlock_t *lock, unsigned long flags);

void read_lock_irq(rwlock_t *lock);

void read_lock_bh(rwlock_t *lock);

用于获取对读取器/写入器锁的读取访问权限的函数。

Functions for obtaining read access to a reader/writer lock.

void read_unlock(rwlock_t *lock);

void read_unlock_irqrestore(rwlock_t *lock, unsigned long flags);

void read_unlock_irq(rwlock_t *lock);

void read_unlock_bh(rwlock_t *lock);
void read_unlock(rwlock_t *lock);

void read_unlock_irqrestore(rwlock_t *lock, unsigned long flags);

void read_unlock_irq(rwlock_t *lock);

void read_unlock_bh(rwlock_t *lock);

用于释放对读取器/写入器自旋锁的读取访问权限的函数。

Functions for releasing read access to a reader/writer spinlock.

void write_lock(rwlock_t *lock);

void write_lock_irqsave(rwlock_t *lock, unsigned long flags);

void write_lock_irq(rwlock_t *lock);

void write_lock_bh(rwlock_t *lock);
void write_lock(rwlock_t *lock);

void write_lock_irqsave(rwlock_t *lock, unsigned long flags);

void write_lock_irq(rwlock_t *lock);

void write_lock_bh(rwlock_t *lock);

用于获取对读/写锁的写访问权限的函数。

Functions for obtaining write access to a reader/writer lock.

void write_unlock(rwlock_t *lock);

void write_unlock_irqrestore(rwlock_t *lock, unsigned long flags);

void write_unlock_irq(rwlock_t *lock);

void write_unlock_bh(rwlock_t *lock);
void write_unlock(rwlock_t *lock);

void write_unlock_irqrestore(rwlock_t *lock, unsigned long flags);

void write_unlock_irq(rwlock_t *lock);

void write_unlock_bh(rwlock_t *lock);

用于释放对读取器/写入器自旋锁的写入访问的函数。

Functions for releasing write access to a reader/writer spinlock.

#include <asm/atomic.h>

atomic_t v = ATOMIC_INIT(value);

void atomic_set(atomic_t *v, int i);

int atomic_read(atomic_t *v);

void atomic_add(int i, atomic_t *v);

void atomic_sub(int i, atomic_t *v);

void atomic_inc(atomic_t *v);

void atomic_dec(atomic_t *v);

int atomic_inc_and_test(atomic_t *v);

int atomic_dec_and_test(atomic_t *v);

int atomic_sub_and_test(int i, atomic_t *v);

int atomic_add_negative(int i, atomic_t *v);

int atomic_add_return(int i, atomic_t *v);

int atomic_sub_return(int i, atomic_t *v);

int atomic_inc_return(atomic_t *v);

int atomic_dec_return(atomic_t *v);
#include <asm/atomic.h>

atomic_t v = ATOMIC_INIT(value);

void atomic_set(atomic_t *v, int i);

int atomic_read(atomic_t *v);

void atomic_add(int i, atomic_t *v);

void atomic_sub(int i, atomic_t *v);

void atomic_inc(atomic_t *v);

void atomic_dec(atomic_t *v);

int atomic_inc_and_test(atomic_t *v);

int atomic_dec_and_test(atomic_t *v);

int atomic_sub_and_test(int i, atomic_t *v);

int atomic_add_negative(int i, atomic_t *v);

int atomic_add_return(int i, atomic_t *v);

int atomic_sub_return(int i, atomic_t *v);

int atomic_inc_return(atomic_t *v);

int atomic_dec_return(atomic_t *v);

以原子方式访问整数变量。atomic_t只能通过这些函数访问变量。

Atomically access integer variables. The atomic_t variables must be accessed only through these functions.

#include <asm/bitops.h>

void set_bit(nr, void *addr);

void clear_bit(nr, void *addr);

void change_bit(nr, void *addr);

test_bit(nr, void *addr);

int test_and_set_bit(nr, void *addr);

int test_and_clear_bit(nr, void *addr);

int test_and_change_bit(nr, void *addr);
#include <asm/bitops.h>

void set_bit(nr, void *addr);

void clear_bit(nr, void *addr);

void change_bit(nr, void *addr);

test_bit(nr, void *addr);

int test_and_set_bit(nr, void *addr);

int test_and_clear_bit(nr, void *addr);

int test_and_change_bit(nr, void *addr);

原子访问位值;它们可用于标志或锁定变量。使用这些函数可以防止与并发访问该位相关的任何竞争条件。

Atomically access bit values; they can be used for flags or lock variables. Using these functions prevents any race condition related to concurrent access to the bit.

#include <linux/seqlock.h>

seqlock_t lock = SEQLOCK_UNLOCKED;

seqlock_init(seqlock_t *lock);
#include <linux/seqlock.h>

seqlock_t lock = SEQLOCK_UNLOCKED;

seqlock_init(seqlock_t *lock);

定义 seqlock 的包含文件以及初始化它们的两种方法。

The include file defining seqlocks and the two ways of initializing them.

unsigned int read_seqbegin(seqlock_t *lock);

unsigned int read_seqbegin_irqsave(seqlock_t *lock, unsigned long flags);

int read_seqretry(seqlock_t *lock, unsigned int seq);

int read_seqretry_irqrestore(seqlock_t *lock, unsigned int seq, unsigned long

flags);
unsigned int read_seqbegin(seqlock_t *lock);

unsigned int read_seqbegin_irqsave(seqlock_t *lock, unsigned long flags);

int read_seqretry(seqlock_t *lock, unsigned int seq);

int read_seqretry_irqrestore(seqlock_t *lock, unsigned int seq, unsigned long

flags);

用于获取对受 seqlock 保护的资源的读取访问权限的函数。

Functions for obtaining read access to a seqlock-protected resources.

void write_seqlock(seqlock_t *lock);

void write_seqlock_irqsave(seqlock_t *lock, unsigned long flags);

void write_seqlock_irq(seqlock_t *lock);

void write_seqlock_bh(seqlock_t *lock);

int write_tryseqlock(seqlock_t *lock);
void write_seqlock(seqlock_t *lock);

void write_seqlock_irqsave(seqlock_t *lock, unsigned long flags);

void write_seqlock_irq(seqlock_t *lock);

void write_seqlock_bh(seqlock_t *lock);

int write_tryseqlock(seqlock_t *lock);

用于获取对 seqlock 保护资源的写访问权限的函数。

Functions for obtaining write access to a seqlock-protected resource.

void write_sequnlock(seqlock_t *lock);

void write_sequnlock_irqrestore(seqlock_t *lock, unsigned long flags);

void write_sequnlock_irq(seqlock_t *lock);

void write_sequnlock_bh(seqlock_t *lock);
void write_sequnlock(seqlock_t *lock);

void write_sequnlock_irqrestore(seqlock_t *lock, unsigned long flags);

void write_sequnlock_irq(seqlock_t *lock);

void write_sequnlock_bh(seqlock_t *lock);

用于释放对 seqlock 保护资源的写访问权限的函数。

Functions for releasing write access to a seqlock-protected resource.

#include <linux/rcupdate.h>
#include <linux/rcupdate.h>

使用读取-复制-更新 (RCU) 机制所需的包含文件。

The include file required to use the read-copy-update (RCU) mechanism.

void rcu_read_lock;

void rcu_read_unlock;
void rcu_read_lock;

void rcu_read_unlock;

用于获得对受 RCU 保护的资源的原子读取访问权限的宏。

Macros for obtaining atomic read access to a resource protected by RCU.

void call_rcu(struct rcu_head *head, void (*func)(void *arg), void *arg);
void call_rcu(struct rcu_head *head, void (*func)(void *arg), void *arg);

在调度所有处理器并且可以安全释放受 RCU 保护的资源后安排回调运行。

Arranges for a callback to run after all processors have been scheduled and an RCU-protected resource can be safely freed.




[ 1 ]就本章而言,执行“线程”是指运行代码的任何上下文。每个进程显然都是一个执行线程,但中断处理程序或响应异步内核事件而运行的其他代码也是如此。

[1] For the purposes of this chapter, a "thread" of execution is any context that is running code. Each process is clearly a thread of execution, but so is an interrupt handler or other code running in response to an asynchronous kernel event.

[ 2 ]截至撰写本文时,添加可中断版本的补丁正在流通,但尚未合并到主线中。

[2] As of this writing, patches adding interruptible versions were in circulation but had not been merged into the mainline.

[ 3 ]这个锁在 2.6 中仍然存在,尽管它现在只覆盖了很少的内核。如果您偶然发现了lock_kernel调用,那么您已经找到了大内核锁。但是,甚至不要考虑在任何新代码中使用它。

[3] This lock still exists in 2.6, though it covers very little of the kernel now. If you stumble across a lock_kernel call, you have found the big kernel lock. Do not even think about using it in any new code, however.

第 6 章 高级 Char 驱动程序操作

Chapter 6. Advanced Char Driver Operations

第 3 章中,我们构建了一个完整的设备驱动程序,用户可以对其进行写入和读取。但实际设备通常提供比同步读写更多功能。既然我们已经配备了出现问题时的调试工具,并且对并发问题有深入的了解,以帮助防止出现问题,那么我们就可以安全地继续创建更高级的驱动程序。

In Chapter 3, we built a complete device driver that the user can write to and read from. But a real device usually offers more functionality than synchronous read and write. Now that we're equipped with debugging tools should something go awry—and a firm understanding of concurrency issues to help keep things from going awry—we can safely go ahead and create a more advanced driver.

本章研究了编写功能齐全的字符设备驱动程序时需要了解的一些概念。我们从实现ioctl系统调用开始,它是用于设备控制的通用接口。然后我们继续进行与用户空间同步的各种方式;到本章结束时,您已经很好地了解了如何使进程进入睡眠状态(并唤醒它们)、实现非阻塞 I/O 以及在设备可用于读取或写入时通知用户空间。最后我们将了解如何在驱动程序中实现一些不同的设备访问策略。

This chapter examines a few concepts that you need to understand to write fully featured char device drivers. We start with implementing the ioctl system call, which is a common interface used for device control. Then we proceed to various ways of synchronizing with user space; by the end of this chapter you have a good idea of how to put processes to sleep (and wake them up), implement nonblocking I/O, and inform user space when your devices are available for reading or writing. We finish with a look at how to implement a few different device access policies within drivers.

这里讨论的想法是通过scull驱动程序的几个修改版本来演示的 。再次强调,一切都是使用内存虚拟设备实现的,因此您可以自己尝试代码,而不需要任何特定的硬件。现在,您可能想亲自动手使用真正的硬件,但这必须等到第 9 章

The ideas discussed here are demonstrated by way of a couple of modified versions of the scull driver. Once again, everything is implemented using in-memory virtual devices, so you can try out the code yourself without needing to have any particular hardware. By now, you may be wanting to get your hands dirty with real hardware, but that will have to wait until Chapter 9.

读写控制

ioctl

大多数驾驶员还需要 读写设备的能力——通过设备驱动程序执行各种类型的硬件控制的能力。大多数设备可以执行简单数据传输之外的操作;用户空间通常必须能够请求设备锁门、弹出介质、报告错误信息、更改波特率或自毁等。这些操作通常通过ioctl方法来支持,该方法通过同名实现系统调用。

Most drivers need—in addition to the ability to read and write the device—the ability to perform various types of hardware control via the device driver. Most devices can perform operations beyond simple data transfers; user space must often be able to request, for example, that the device lock its door, eject its media, report error information, change a baud rate, or self destruct. These operations are usually supported via the ioctl method, which implements the system call by the same name.

在用户空间中,ioctl系统调用具有以下原型:

In user space, the ioctl system call has the following prototype:

int ioctl(int fd, unsigned long cmd, ...);
int ioctl(int fd, unsigned long cmd, ...);

该原型由于点的存在而在 Unix 系统调用列表中脱颖而出,这些点通常将函数标记为具有可变数量的参数。然而,在真实的系统中,系统调用实际上不能具有可变数量的参数。系统调用必须有一个明确定义的原型,因为用户程序只能通过硬件“门”访问它们。因此,原型中的点代表的不是可变数量的参数,而是单个可选参数,传统上标识为char *argp。这些点只是为了防止编译期间进行类型检查。第三个参数的实际性质取决于发出的特定控制命令(第二个参数)。有些命令不带参数,有些命令带整数值,有些命令带指向其他数据的指针。使用指针是将任意数据传递给 ioctl调用的方法;然后设备就能够与用户空间交换任意数量的数据。

The prototype stands out in the list of Unix system calls because of the dots, which usually mark the function as having a variable number of arguments. In a real system, however, a system call can't actually have a variable number of arguments. System calls must have a well-defined prototype, because user programs can access them only through hardware "gates." Therefore, the dots in the prototype represent not a variable number of arguments but a single optional argument, traditionally identified as char *argp. The dots are simply there to prevent type checking during compilation. The actual nature of the third argument depends on the specific control command being issued (the second argument). Some commands take no arguments, some take an integer value, and some take a pointer to other data. Using a pointer is the way to pass arbitrary data to the ioctl call; the device is then able to exchange any amount of data with user space.

ioctl调用的非结构化性质导致它失去了内核开发人员的青睐。每个ioctl命令本质上都是一个单独的、通常未记录的系统调用,并且无法以任何全面的方式审核这些调用。非结构化的ioctl也很难制作参数在所有系统上的工作方式相同;例如,考虑 64 位系统,其中用户空间进程以 32 位模式运行。因此,存在着通过任何其他方式实施杂项控制操作的巨大压力。可能的替代方案包括将命令嵌入到数据流中(我们将在本章后面讨论这种方法)或使用虚拟文件系统(sysfs 或特定于驱动程序的文件系统)。(我们将在第 14 章中讨论 sysfs 。)然而,事实仍然是ioctl通常是真正设备操作的最简单、最直接的选择。

The unstructured nature of the ioctl call has caused it to fall out of favor among kernel developers. Each ioctl command is, essentially, a separate, usually undocumented system call, and there is no way to audit these calls in any sort of comprehensive manner. It is also difficult to make the unstructured ioctl arguments work identically on all systems; for example, consider 64-bit systems with a user-space process running in 32-bit mode. As a result, there is strong pressure to implement miscellaneous control operations by just about any other means. Possible alternatives include embedding commands into the data stream (we will discuss this approach later in this chapter) or using virtual filesystems, either sysfs or driver-specific filesystems. (We will look at sysfs in Chapter 14.) However, the fact remains that ioctl is often the easiest and most straightforward choice for true device operations.

ioctl驱动程序方法原型与用户空间版本有所不同:

The ioctl driver method has a prototype that differs somewhat from the user-space version:

int (*ioctl) (struct inode *inode, struct file *filp,
              无符号 int cmd、无符号长 arg);
int (*ioctl) (struct inode *inode, struct file *filp,
              unsigned int cmd, unsigned long arg);

和 指针是与应用程序传递的文件描述符相对应的值,并且与传递给openinode方法的参数相同 。参数从用户处原样传递,可选 参数以 的形式传递,无论用户是作为整数还是指针给出。如果调用程序未传递第三个参数,则驱动程序操作接收到的值是未定义的。由于对额外参数禁用类型检查,因此如果将无效参数传递给ioctl ,编译器无法警告您,并且很难发现任何相关的错误。filpfdcmdargunsigned longarg

The inode and filp pointers are the values corresponding to the file descriptor fd passed on by the application and are the same parameters passed to the open method. The cmd argument is passed from the user unchanged, and the optional arg argument is passed in the form of an unsigned long, regardless of whether it was given by the user as an integer or a pointer. If the invoking program doesn't pass a third argument, the arg value received by the driver operation is undefined. Because type checking is disabled on the extra argument, the compiler can't warn you if an invalid argument is passed to ioctl, and any associated bug would be difficult to spot.

正如您可能想象的那样,大多数 ioctl实现都包含一个大switch语句,该语句根据cmd参数选择正确的行为。不同的命令有不同的数值,通常给出符号名称以简化编码。符号名称由预处理器定义分配。自定义驱动程序通常在其头文件中声明此类符号;scull.h为scull声明它们 。当然,用户程序还必须包含该头文件才能访问这些符号。

As you might imagine, most ioctl implementations consist of a big switch statement that selects the correct behavior according to the cmd argument. Different commands have different numeric values, which are usually given symbolic names to simplify coding. The symbolic name is assigned by a preprocessor definition. Custom drivers usually declare such symbols in their header files; scull.h declares them for scull. User programs must, of course, include that header file as well to have access to those symbols.

选择 ioctl 命令

Choosing the ioctl Commands

在编写ioctl代码之前,您需要选择与命令相对应的数字。许多程序员的第一直觉是选择一组以 或 1 开头并从那里向上递增的小数字。然而,我们有充分的理由不这样做。ioctl命令编号在整个系统中应该是唯一的,以防止因向错误设备发出正确命令而导致错误。这种不匹配并非不可能发生,程序可能会发现自己试图更改非串行端口输入流(例如 FIFO 或音频设备)的波特率。如果每个ioctl编号都是唯一的,则应用程序会收到EINVAL错误,而不是成功执行意外操作。

Before writing the code for ioctl, you need to choose the numbers that correspond to commands. The first instinct of many programmers is to choose a set of small numbers starting with or 1 and going up from there. There are, however, good reasons for not doing things that way. The ioctl command numbers should be unique across the system in order to prevent errors caused by issuing the right command to the wrong device. Such a mismatch is not unlikely to happen, and a program might find itself trying to change the baud rate of a non-serial-port input stream, such as a FIFO or an audio device. If each ioctl number is unique, the application gets an EINVAL error rather than succeeding in doing something unintended.

为了帮助程序员创建独特的 ioctl命令代码,这些代码已分为多个位字段。Linux 的第一个版本使用 16 位数字:前八位是与设备关联的“魔法”数字,后八位是连续数字,在设备内是唯一的。发生这种情况是因为莱纳斯“一无所知”(他自己说);后来才构思出更好的位域划分。不幸的是,相当多的驱动程序仍然使用旧的约定。他们必须这样做:更改命令代码会破坏二进制程序的无穷无尽,而这不是内核开发人员愿意做的事情。

To help programmers create unique ioctl command codes, these codes have been split up into several bitfields. The first versions of Linux used 16-bit numbers: the top eight were the "magic" numbers associated with the device, and the bottom eight were a sequential number, unique within the device. This happened because Linus was "clueless" (his own word); a better division of bitfields was conceived only later. Unfortunately, quite a few drivers still use the old convention. They have to: changing the command codes would break no end of binary programs, and that is not something the kernel developers are willing to do.

要根据 Linux 内核约定为您的驱动程序选择 ioctl编号,您应该首先检查include/asm/ioctl.hDocumentation/ioctl-number.txt。标头定义您将使用的位字段:类型(幻数)、序数、传输方向和参数大小。ioctl -number.txt文件列出了整个内核中使用的幻数,[ 1 ],因此您将能够选择自己的幻数并避免重叠。该文本文件还列出了应使用该约定的原因。

To choose ioctl numbers for your driver according to the Linux kernel convention, you should first check include/asm/ioctl.h and Documentation/ioctl-number.txt. The header defines the bitfields you will be using: type (magic number), ordinal number, direction of transfer, and size of argument. The ioctl-number.txt file lists the magic numbers used throughout the kernel,[1] so you'll be able to choose your own magic number and avoid overlaps. The text file also lists the reasons why the convention should be used.

定义ioctl命令编号的批准方法使用四个位字段,它们具有以下含义。此列表中引入的新符号在<linux/ioctl.h>中定义。

The approved way to define ioctl command numbers uses four bitfields, which have the following meanings. New symbols introduced in this list are defined in <linux/ioctl.h>.

type
type

神奇的数字。只需选择一个数字(在查阅ioctl-number.txt后)并在整个驱动程序中使用它。该字段为八位宽 ( _IOC_TYPEBITS)。

The magic number. Just choose one number (after consulting ioctl-number.txt) and use it throughout the driver. This field is eight bits wide (_IOC_TYPEBITS).

number
number

序数(顺序)数。它是八位 ( _IOC_NRBITS) 宽。

The ordinal (sequential) number. It's eight bits (_IOC_NRBITS) wide.

direction
direction

数据传输的方向(如果特定命令涉及数据传输)。可能的值有_IOC_NONE(无数据传输)、_IOC_READ_IOC_WRITE_IOC_READ|_IOC_WRITE(双向传输数据)。数据传输是从应用程序的角度来看的;意味着从设备_IOC_READ读取,因此驱动程序必须写入用户空间。请注意,该字段是位掩码,因此可以使用逻辑 AND 运算来提取 和。_IOC_READ_IOC_WRITE

The direction of data transfer, if the particular command involves a data transfer. The possible values are _IOC_NONE (no data transfer), _IOC_READ, _IOC_WRITE, and _IOC_READ|_IOC_WRITE (data is transferred both ways). Data transfer is seen from the application's point of view; _IOC_READ means reading from the device, so the driver must write to user space. Note that the field is a bit mask, so _IOC_READ and _IOC_WRITE can be extracted using a logical AND operation.

size
size

涉及的用户数据大小。该字段的宽度取决于体系结构,但通常为 13 或 14 位。您可以在宏中找到它对于您的特定架构的价值_IOC_SIZEBITS。使用该字段并不是强制性的size——内核不会检查它——但这是一个好主意。正确使用此字段可以帮助检测用户空间编程错误,并使您能够在需要更改相关数据项的大小时实现向后兼容性。但是,如果您需要更大的数据结构,则可以忽略该size 字段。我们很快就会看到这个字段是如何使用的。

The size of user data involved. The width of this field is architecture dependent, but is usually 13 or 14 bits. You can find its value for your specific architecture in the macro _IOC_SIZEBITS. It's not mandatory that you use the size field—the kernel does not check it—but it is a good idea. Proper use of this field can help detect user-space programming errors and enable you to implement backward compatibility if you ever need to change the size of the relevant data item. If you need larger data structures, however, you can just ignore the size field. We'll see how this field is used soon.

头文件<asm/ioctl.h>包含在 <linux/ioctl.h>中,定义了帮助设置命令编号的宏,如下所示:(_IO(type,nr)对于没有参数的命令),_IOR(type,nr,datatype) (用于从驱动程序)、_IOW(type,nr,datatype)(用于写入数据)和_IOWR(type,nr,datatype)(用于双向传输)。和字段作为参数传递,并且通过对参数type应用sizeof来派生字段 。numbersizedatatype

The header file <asm/ioctl.h>, which is included by <linux/ioctl.h>, defines macros that help set up the command numbers as follows: _IO(type,nr) (for a command that has no argument), _IOR(type,nr,datatype) (for reading data from the driver), _IOW(type,nr,datatype) (for writing data), and _IOWR(type,nr,datatype) (for bidirectional transfers). The type and number fields are passed as arguments, and the size field is derived by applying sizeof to the datatype argument.

标头还定义了可在驱动程序中用于解码数字的宏: _IOC_DIR(nr)_IOC_TYPE(nr)_IOC_NR(nr)_IOC_SIZE(nr)。我们不会详细介绍这些宏,因为头文件很清楚,并且示例代码将在本节后面显示。

The header also defines macros that may be used in your driver to decode the numbers: _IOC_DIR(nr), _IOC_TYPE(nr), _IOC_NR(nr), and _IOC_SIZE(nr). We won't go into any more detail about these macros because the header file is clear, and sample code is shown later in this section.

以下是scull中一些ioctl命令的定义方式。特别是,这些命令设置和获取驱动程序的可配置参数。

Here is how some ioctl commands are defined in scull. In particular, these commands set and get the driver's configurable parameters.

/* 使用 'k' 作为幻数 */
#define SCULL_IOC_MAGIC 'k'
/* 请在代码中使用不同的 8 位数字 */

#define SCULL_IOCRESET _IO(SCULL_IOC_MAGIC, 0)

/*
 * S 表示通过 ptr“设置”,
 * T 表示直接用参数值“告诉”
 * G表示“Get”:通过指针设置回复
 * Q 表示“查询”:响应在返回值上
 * X 表示“eXchange”:原子交换 G 和 S
 * H 表示“sHift”:自动切换 T 和 Q
 */
#define SCULL_IOCSQUANTUM _IOW(SCULL_IOC_MAGIC, 1, int)
#define SCULL_IOCSQSET _IOW(SCULL_IOC_MAGIC, 2, int)
#define SCULL_IOCTQUANTUM _IO(SCULL_IOC_MAGIC, 3)
#define SCULL_IOCTQSET _IO(SCULL_IOC_MAGIC, 4)
#define SCULL_IOCGQUANTUM _IOR(SCULL_IOC_MAGIC, 5, int)
#define SCULL_IOCGQSET _IOR(SCULL_IOC_MAGIC, 6, int)
#define SCULL_IOCQQUANTUM _IO(SCULL_IOC_MAGIC, 7)
#define SCULL_IOCQQSET _IO(SCULL_IOC_MAGIC, 8)
#define SCULL_IOCXQUANTUM _IOWR(SCULL_IOC_MAGIC, 9, int)
#define SCULL_IOCXQSET _IOWR(SCULL_IOC_MAGIC,10, int)
#define SCULL_IOCHQUANTUM _IO(SCULL_IOC_MAGIC, 11)
#define SCULL_IOCHQSET _IO(SCULL_IOC_MAGIC, 12)

#定义SCULL_IOC_MAXNR 14
/* Use 'k' as magic number */
#define SCULL_IOC_MAGIC  'k'
/* Please use a different 8-bit number in your code */

#define SCULL_IOCRESET    _IO(SCULL_IOC_MAGIC, 0)

/*
 * S means "Set" through a ptr,
 * T means "Tell" directly with the argument value
 * G means "Get": reply by setting through a pointer
 * Q means "Query": response is on the return value
 * X means "eXchange": switch G and S atomically
 * H means "sHift": switch T and Q atomically
 */
#define SCULL_IOCSQUANTUM _IOW(SCULL_IOC_MAGIC,  1, int)
#define SCULL_IOCSQSET    _IOW(SCULL_IOC_MAGIC,  2, int)
#define SCULL_IOCTQUANTUM _IO(SCULL_IOC_MAGIC,   3)
#define SCULL_IOCTQSET    _IO(SCULL_IOC_MAGIC,   4)
#define SCULL_IOCGQUANTUM _IOR(SCULL_IOC_MAGIC,  5, int)
#define SCULL_IOCGQSET    _IOR(SCULL_IOC_MAGIC,  6, int)
#define SCULL_IOCQQUANTUM _IO(SCULL_IOC_MAGIC,   7)
#define SCULL_IOCQQSET    _IO(SCULL_IOC_MAGIC,   8)
#define SCULL_IOCXQUANTUM _IOWR(SCULL_IOC_MAGIC, 9, int)
#define SCULL_IOCXQSET    _IOWR(SCULL_IOC_MAGIC,10, int)
#define SCULL_IOCHQUANTUM _IO(SCULL_IOC_MAGIC,  11)
#define SCULL_IOCHQSET    _IO(SCULL_IOC_MAGIC,  12)

#define SCULL_IOC_MAXNR 14

实际的源文件定义了一些此处未显示的额外命令。

The actual source file defines a few extra commands that have not been shown here.

我们选择实现传递整数参数的两种方式:通过指针和通过显式值(尽管根据既定约定,ioctl应该通过指针交换值)。类似地,这两种方式都用于返回整数:通过指针或通过设置返回值。只要返回值为正整数,此方​​法就有效;正如您现在所知,从任何系统调用返回时,都会保留正值(正如我们在readwrite中看到的那样),而负值则被视为错误并用于errno在用户空间中设置。[ 2 ]

We chose to implement both ways of passing integer arguments: by pointer and by explicit value (although, by an established convention, ioctl should exchange values by pointer). Similarly, both ways are used to return an integer number: by pointer or by setting the return value. This works as long as the return value is a positive integer; as you know by now, on return from any system call, a positive value is preserved (as we saw for read and write), while a negative value is considered an error and is used to set errno in user space.[2]

“交换”和“移位”操作对于scull来说并不是特别有用 。我们实现了“交换”来展示驱动程序如何将单独的操作组合成一个原子操作,并实现“移位”来配对“告诉”和“查询”。有时需要像这样的原子测试和设置操作,特别是当应用程序需要设置或释放锁时。

The "exchange" and "shift" operations are not particularly useful for scull. We implemented "exchange" to show how the driver can combine separate operations into a single atomic one, and "shift" to pair "tell" and "query." There are times when atomic test-and-set operations like these are needed, in particular, when applications need to set or release locks.

命令的显式序号没有特定含义。它仅用于区分命令。实际上,您甚至可以对读取命令和写入命令使用相同的序号,因为实际的ioctl编号在“方向”位中是不同的,但您没有理由这样做。我们选择不在声明中以外的任何地方使用命令的序数,因此我们没有为其分配符号值。这就是为什么明确的数字出现在前面给出的定义中的原因。该示例显示了使用命令编号的一种方法,但您可以自由地采用不同的方式。

The explicit ordinal number of the command has no specific meaning. It is used only to tell the commands apart. Actually, you could even use the same ordinal number for a read command and a write command, since the actual ioctl number is different in the "direction" bits, but there is no reason why you would want to do so. We chose not to use the ordinal number of the command anywhere but in the declaration, so we didn't assign a symbolic value to it. That's why explicit numbers appear in the definition given previously. The example shows one way to use the command numbers, but you are free to do it differently.

除了少数预定义命令(稍后讨论)之外,ioctl cmd参数的值当前并未被内核使用,并且将来也不太可能使用。因此,如果您感到懒惰,可以避免前面显示的复杂声明并显式声明一组标量。另一方面,如果您这样做,您将不会从使用位域中受益,并且如果您提交代码以包含在主线内核中,您将会遇到困难。头文件<linux/kd.h>是这种老式方法的一个示例,使用 16 位标量值来定义 ioctl命令。该源文件依赖于标量,因为它使用了当时遵守的约定,而不是出于懒惰。现在更改它会导致无端的不兼容。

With the exception of a small number of predefined commands (to be discussed shortly), the value of the ioctl cmd argument is not currently used by the kernel, and it's quite unlikely it will be in the future. Therefore, you could, if you were feeling lazy, avoid the complex declarations shown earlier and explicitly declare a set of scalar numbers. On the other hand, if you did, you wouldn't benefit from using the bitfields, and you would encounter difficulties if you ever submitted your code for inclusion in the mainline kernel. The header <linux/kd.h> is an example of this old-fashioned approach, using 16-bit scalar values to define the ioctl commands. That source file relied on scalar numbers because it used the conventions obeyed at that time, not out of laziness. Changing it now would cause gratuitous incompatibility.

返回值

The Return Value

ioctl的实现通常是 switch 基于命令编号的语句。但是,default当命令编号与有效操作不匹配时,应该选择什么?这个问题是有争议的。几个内核函数返回-EINVAL(“无效参数”),这是有道理的,因为命令参数确实不是有效的。然而,POSIX 标准规定,如果 发出了不适当的ioctl-ENOTTY命令,则应返回。此错误代码被 C 库解释为“设备的 ioctl 不合适”,这通常正是程序员需要听到的。不过,返回-EINVAL以响应无效的ioctl命令仍然很常见。

The implementation of ioctl is usually a switch statement based on the command number. But what should the default selection be when the command number doesn't match a valid operation? The question is controversial. Several kernel functions return -EINVAL ("Invalid argument"), which makes sense because the command argument is indeed not a valid one. The POSIX standard, however, states that if an inappropriate ioctl command has been issued, then -ENOTTY should be returned. This error code is interpreted by the C library as "inappropriate ioctl for device," which is usually exactly what the programmer needs to hear. It's still pretty common, though, to return -EINVAL in response to an invalid ioctl command.

预定义命令

The Predefined Commands

虽然ioctl系统调用最常用于对设备进行操作,但内核可以识别一些命令。请注意,这些命令在应用于您的设备时,会在调用您自己的文件操作之前进行解码。因此,如果您为某个ioctl 命令选择相同的数字,您将不会看到对该命令的任何请求,并且应用程序会得到一些信息由于ioctl数字之间的冲突,导致意外。

Although the ioctl system call is most often used to act on devices, a few commands are recognized by the kernel. Note that these commands, when applied to your device, are decoded before your own file operations are called. Thus, if you choose the same number for one of your ioctl commands, you won't ever see any request for that command, and the application gets something unexpected because of the conflict between the ioctl numbers.

预定义的 命令分为三组:

The predefined commands are divided into three groups:

  • 可以在任何文件(常规、设备、FIFO 或套接字)上发出的文件

  • Those that can be issued on any file (regular, device, FIFO, or socket)

  • 仅以常规文件形式发布的

  • Those that are issued only on regular files

  • 那些特定于文件系统类型的

  • Those specific to the filesystem type

最后一组中的命令由托管文件系统的实现执行(这就是chattr命令的工作原理)。设备驱动程序编写者只对第一组命令感兴趣,其神奇数字是“T”。查看其他小组的工作方式留给读者作为练习; ext2_ioctl是一个最有趣的函数(并且比人们想象的更容易理解),因为它实现了仅追加标志和不可变标志。

Commands in the last group are executed by the implementation of the hosting filesystem (this is how the chattr command works). Device driver writers are interested only in the first group of commands, whose magic number is "T." Looking at the workings of the other groups is left to the reader as an exercise; ext2_ioctl is a most interesting function (and easier to understand than one might expect), because it implements the append-only flag and the immutable flag.

以下ioctl命令是为任何文件预定义的,包括设备专用文件:

The following ioctl commands are predefined for any file, including device-special files:

FIOCLEX
FIOCLEX

设置执行时关闭标志(File IOctl CLose on EXec)。设置此标志会导致调用进程执行新程序时关闭文件描述符。

Set the close-on-exec flag (File IOctl CLose on EXec). Setting this flag causes the file descriptor to be closed when the calling process executes a new program.

FIONCLEX
FIONCLEX

清除 close-on-exec 标志(File IOctl Not CLos on EXec)。该命令恢复常见文件行为,撤消FIOCLEX上述操作。

Clear the close-on-exec flag (File IOctl Not CLos on EXec). The command restores the common file behavior, undoing what FIOCLEX above does.

FIOASYNC
FIOASYNC

设置或重置文件的异步通知(如本章后面的6.4 节中所述)。请注意,Linux 2.2.4 之前的内核版本错误地使用此命令来修改 O_SYNC标志。由于这两个操作都可以通过fcntl完成,因此实际上没有人使用该FIOASYNC命令,此处报告该命令只是为了完整性。

Set or reset asynchronous notification for the file (as discussed in the Section 6.4 later in this chapter). Note that kernel versions up to Linux 2.2.4 incorrectly used this command to modify the O_SYNC flag. Since both actions can be accomplished through fcntl, nobody actually uses the FIOASYNC command, which is reported here only for completeness.

FIOQSIZE
FIOQSIZE

命令返回文件或目录的大小;然而,当应用于设备文件时,它会产生ENOTTY错误返回。

This command returns the size of a file or directory; when applied to a device file, however, it yields an ENOTTY error return.

FIONBIO
FIONBIO

“File IOctl 非阻塞 I/O”( 第 6.2.3 节中描述)。此调用修改O_NONBLOCK中的标志filp->f_flags。系统调用的第三个参数用于指示是否要设置或清除该标志。(我们将在本章后面讨论该标志的作用。)请注意,更改该标志的常用方法是通过 fcntl系统调用,使用 F_SETFL命令。

"File IOctl Non-Blocking I/O" (described in Section 6.2.3). This call modifies the O_NONBLOCK flag in filp->f_flags. The third argument to the system call is used to indicate whether the flag is to be set or cleared. (We'll look at the role of the flag later in this chapter.) Note that the usual way to change this flag is with the fcntl system call, using the F_SETFL command.

列表中的最后一项引入了一个新的系统调用fcntl,它看起来像ioctl。事实上,fcntl调用与ioctl非常相似,因为它获取一个命令参数和一个额外的(可选)参数。它与ioctl分开主要是由于历史原因:当 Unix 开发人员面临控制 I/O 操作的问题时,他们认为文件和设备是不同的。当时,唯一具有 ioctl实现的设备是 tty,这解释了为什么这是错误ioctl-ENOTTY命令的标准回复 。事情已经改变了,但fcntl 仍然是一个单独的系统调用。

The last item in the list introduced a new system call, fcntl, which looks like ioctl. In fact, the fcntl call is very similar to ioctl in that it gets a command argument and an extra (optional) argument. It is kept separate from ioctl mainly for historical reasons: when Unix developers faced the problem of controlling I/O operations, they decided that files and devices were different. At the time, the only devices with ioctl implementations were ttys, which explains why -ENOTTY is the standard reply for an incorrect ioctl command. Things have changed, but fcntl remains a separate system call.

使用 ioctl 参数

Using the ioctl Argument

我们需要讨论的另一点在查看scull驱动程序的ioctl代码之前,先了解如何使用额外的参数。如果它是一个整数,那就很简单:可以直接使用。然而,如果它是一个指针,则必须小心。

Another point we need to cover before looking at the ioctl code for the scull driver is how to use the extra argument. If it is an integer, it's easy: it can be used directly. If it is a pointer, however, some care must be taken.

当使用指针来引用用户空间时,我们必须保证用户地址是有效的。尝试访问未经验证的用户提供的指针可能会导致不正确的行为、内核错误、系统损坏或安全问题。驱动程序有责任对其使用的每个用户空间地址进行适当的检查,如果无效则返回错误。

When a pointer is used to refer to user space, we must ensure that the user address is valid. An attempt to access an unverified user-supplied pointer can lead to incorrect behavior, a kernel oops, system corruption, or security problems. It is the driver's responsibility to make proper checks on every user-space address it uses and to return an error if it is invalid.

第 3 章中,我们了解了copy_from_usercopy_to_user函数,它们可用于安全地将数据移入和移出用户空间。这些函数也可以在ioctl方法中使用,但ioctl调用通常涉及小数据项,可以通过其他方式更有效地操作这些数据项。首先,地址验证(不传输数据)由函数access_ok实现,该函数在<asm/uaccess.h>中声明:

In Chapter 3, we looked at the copy_from_user and copy_to_user functions, which can be used to safely move data to and from user space. Those functions can be used in ioctl methods as well, but ioctl calls often involve small data items that can be more efficiently manipulated through other means. To start, address verification (without transferring data) is implemented by the function access_ok, which is declared in <asm/uaccess.h>:

int access_ok(int 类型, const void *addr, unsigned long size);
int access_ok(int type, const void *addr, unsigned long size);

第一个参数应该是VERIFY_READVERIFY_WRITE,具体取决于要执行的操作是读取用户空间内存区域还是写入它。该addr参数保存用户空间地址,并且size是字节数。例如,如果ioctlsize需要从用户空间读取整数值,则为sizeof(int). 如果您需要在给定地址处读取和写入,请使用 VERIFY_WRITE,因为它是 的超集VERIFY_READ

The first argument should be either VERIFY_READ or VERIFY_WRITE, depending on whether the action to be performed is reading the user-space memory area or writing it. The addr argument holds a user-space address, and size is a byte count. If ioctl, for instance, needs to read an integer value from user space, size is sizeof(int). If you need to both read and write at the given address, use VERIFY_WRITE, since it is a superset of VERIFY_READ.

与大多数内核函数不同,access_ok返回一个布尔值: 1成功(访问正常)和0失败(访问不正常)。如果返回 false,驱动程序通常应返回-EFAULT到调用者。

Unlike most kernel functions, access_ok returns a boolean value: 1 for success (access is OK) and 0 for failure (access is not OK). If it returns false, the driver should usually return -EFAULT to the caller.

关于access_ok有一些有趣的事情需要注意。首先,它没有完成验证内存访问的完整工作;它仅检查内存引用是否位于进程可能合理访问的内存区域中。特别是,access_ok确保该地址不指向内核空间内存。其次,大多数驱动程序代码实际上不需要调用 access_ok。稍后描述的内存访问例程会为您处理这个问题。尽管如此,我们还是演示了它的用法,以便您可以了解它是如何完成的。

There are a couple of interesting things to note about access_ok. First, it does not do the complete job of verifying memory access; it only checks to see that the memory reference is in a region of memory that the process might reasonably have access to. In particular, access_ok ensures that the address does not point to kernel-space memory. Second, most driver code need not actually call access_ok. The memory-access routines described later take care of that for you. Nonetheless, we demonstrate its use so that you can see how it is done.

scull源利用ioctl编号中的位字段 来检查 之前参数:switch

The scull source exploits the bitfields in the ioctl number to check the arguments before the switch:

int err = 0, tmp;
int retval = 0;
    
/*
 * 提取类型和数字位域,并且不解码
 * 错误的命令:在 access_ok( ) 之前返回 ENOTTY (不适当的 ioctl)
 */
 if (_IOC_TYPE(cmd) != SCULL_IOC_MAGIC) 返回 -ENOTTY;
 if (_IOC_NR(cmd) > SCULL_IOC_MAXNR) 返回 -ENOTTY;

/*
 * 方向是位掩码,VERIFY_WRITE 捕获 R/W
 * 转账。“类型”是面向用户的,而
 * access_ok是面向内核的,所以“读”的概念和
 *“写”颠倒过来
 */
如果(_IOC_DIR(cmd)&_IOC_READ)
    err = !access_ok(VERIFY_WRITE, (void _ _user *)arg, _IOC_SIZE(cmd));
否则如果 (_IOC_DIR(cmd) & _IOC_WRITE)
    err = !access_ok(VERIFY_READ, (void _ _user *)arg, _IOC_SIZE(cmd));
如果(错误)返回-EFAULT;
int err = 0, tmp;
int retval = 0;
    
/*
 * extract the type and number bitfields, and don't decode
 * wrong cmds: return ENOTTY (inappropriate ioctl) before access_ok(  )
 */
 if (_IOC_TYPE(cmd) != SCULL_IOC_MAGIC) return -ENOTTY;
 if (_IOC_NR(cmd) > SCULL_IOC_MAXNR) return -ENOTTY;

/*
 * the direction is a bitmask, and VERIFY_WRITE catches R/W
 * transfers. `Type' is user-oriented, while
 * access_ok is kernel-oriented, so the concept of "read" and
 * "write" is reversed
 */
if (_IOC_DIR(cmd) & _IOC_READ)
    err = !access_ok(VERIFY_WRITE, (void _ _user *)arg, _IOC_SIZE(cmd));
else if (_IOC_DIR(cmd) & _IOC_WRITE)
    err =  !access_ok(VERIFY_READ, (void _ _user *)arg, _IOC_SIZE(cmd));
if (err) return -EFAULT;

调用access_ok后,驱动程序可以安全地执行实际传输。除了copy_from_usercopy_to_user函数之外,程序员还可以利用一组针对最常用的数据大小(一个、两个、四个和八个字节)进行优化的函数。这些函数在下面的列表中描述并在<asm/uaccess.h>中定义:

After calling access_ok, the driver can safely perform the actual transfer. In addition to the copy_from_user and copy_to_user functions, the programmer can exploit a set of functions that are optimized for the most used data sizes (one, two, four, and eight bytes). These functions are described in the following list and are defined in <asm/uaccess.h>:

put_user(datum, ptr)

_ _put_user(datum, ptr)
put_user(datum, ptr)

_ _put_user(datum, ptr)

这些宏将数据写入用户空间;它们相对较快,每当传输单个值时都应该调用它们而不是copy_to_user 。这些宏已编写为允许将任何类型的指针传递给put_user,只要它是用户空间地址即可。数据传输的大小取决于参数的类型,并在编译时使用和编译器内置函数ptr确定 。因此,如果是 char 指针,则传输一个字节,依此类推,传输两个、四个,甚至可能是八个字节。sizeoftypeofptr

These macros write the datum to user space; they are relatively fast and should be called instead of copy_to_user whenever single values are being transferred. The macros have been written to allow the passing of any type of pointer to put_user, as long as it is a user-space address. The size of the data transfer depends on the type of the ptr argument and is determined at compile time using the sizeof and typeof compiler builtins. As a result, if ptr is a char pointer, one byte is transferred, and so on for two, four, and possibly eight bytes.

put_user检查以确保进程能够写入给定的内存地址。它0在成功时返回, -EFAULT在错误时返回。__put_user 执行较少的检查(它不调用access_ok),但如果指向的内存不可被用户写入,则仍然可能失败。因此,仅当内存区域已通过access_ok验证时才应使用__put_user

put_user checks to ensure that the process is able to write to the given memory address. It returns 0 on success, and -EFAULT on error. _ _put_user performs less checking (it does not call access_ok), but can still fail if the memory pointed to is not writable by the user. Thus, _ _put_user should only be used if the memory region has already been verified with access_ok.

作为一般规则,当您实现读取方法或复制多个项目时,您可以调用__put_user来节省一些周期,因此,在第一次数据传输之前仅调用一次access_ok ,如上面的ioctl所示。

As a general rule, you call _ _put_user to save a few cycles when you are implementing a read method, or when you copy several items and, thus, call access_ok just once before the first data transfer, as shown above for ioctl.

get_user(local, ptr)

_ _get_user(local, ptr)
get_user(local, ptr)

_ _get_user(local, ptr)

这些宏用于从用户空间检索单个数据。它们的行为类似于 put_user__put_user,但以相反的方向传输数据。检索到的值存储在局部变量中 local;返回值表示操作是否成功。同样,仅当地址已通过access_ok验证时才应使用__get_user

These macros are used to retrieve a single datum from user space. They behave like put_user and _ _put_user, but transfer data in the opposite direction. The value retrieved is stored in the local variable local; the return value indicates whether the operation succeeded. Again, _ _get_user should only be used if the address has already been verified with access_ok.

如果尝试使用列出的函数之一来传输不适合特定大小之一的值,结果通常是来自编译器的奇怪消息,例如“请求转换为非标量类型”。在这种情况下, 必须使用copy_to_usercopy_from_user 。

If an attempt is made to use one of the listed functions to transfer a value that does not fit one of the specific sizes, the result is usually a strange message from the compiler, such as "conversion to non-scalar type requested." In such cases, copy_to_user or copy_from_user must be used.

能力和受限操作

Capabilities and Restricted Operations

对设备的访问由设备文件的权限控制,驱动程序通常不参与权限检查。然而,在某些情况下,任何用户都被授予对设备的读/写权限,但某些控制操作仍应被拒绝。例如,并非磁带驱动器的所有用户都应该能够设置其默认块大小,并且已被授予对磁盘设备的读/写访问权限的用户可能仍应被拒绝格式化该设备的能力。在此类情况下,驱动程序必须执行额外的检查以确保用户能够执行请求的操作。

Access to a device is controlled by the permissions on the device file(s), and the driver is not normally involved in permissions checking. There are situations, however, where any user is granted read/write permission on the device, but some control operations should still be denied. For example, not all users of a tape drive should be able to set its default block size, and a user who has been granted read/write access to a disk device should probably still be denied the ability to format it. In cases like these, the driver must perform additional checks to be sure that the user is capable of performing the requested operation.

Unix 系统传统上限制超级用户帐户的特权操作。这意味着特权是一件要么全有要么全无的事情——超级用户绝对可以做任何事情,但所有其他用户都受到严格限制。Linux 内核提供了一个更灵活的系统,称为“功能”。基于能力的系统抛弃了全有或全无的模式,并将特权操作分解为单独的子组。通过这种方式,可以授权特定用户(或程序)执行特定的特权操作,而不会放弃执行其他不相关操作的能力。内核专门使用权限管理功能并导出两个系统调用capgetcapset,以允许从用户空间管理它们。

Unix systems have traditionally restricted privileged operations to the superuser account. This meant that privilege was an all-or-nothing thing—the superuser can do absolutely anything, but all other users are highly restricted. The Linux kernel provides a more flexible system called capabilities. A capability-based system leaves the all-or-nothing mode behind and breaks down privileged operations into separate subgroups. In this way, a particular user (or program) can be empowered to perform a specific privileged operation without giving away the ability to perform other, unrelated operations. The kernel uses capabilities exclusively for permissions management and exports two system calls capget and capset, to allow them to be managed from user space.

完整的功能集可以在<linux/capability.h>中找到。这些是系统已知的唯一功能;驱动程序作者或系统管理员不可能在不修改内核源代码的情况下定义新的驱动程序。设备驱动程序编写者可能感兴趣的这些功能的子集包括:

The full set of capabilities can be found in <linux/capability.h>. These are the only capabilities known to the system; it is not possible for driver authors or system administrators to define new ones without modifying the kernel source. A subset of those capabilities that might be of interest to device driver writers includes the following:

CAP_DAC_OVERRIDE
CAP_DAC_OVERRIDE

能够覆盖文件和目录的访问限制(数据访问控制或 DAC)。

The ability to override access restrictions (data access control, or DAC) on files and directories.

CAP_NET_ADMIN
CAP_NET_ADMIN

执行网络管理任务的能力,包括影响网络接口的任务。

The ability to perform network administration tasks, including those that affect network interfaces.

CAP_SYS_MODULE
CAP_SYS_MODULE

加载或删除内核模块的能力。

The ability to load or remove kernel modules.

CAP_SYS_RAWIO
CAP_SYS_RAWIO

执行“原始”I/O 操作的能力。示例包括访问设备端口或直接与 USB 设备通信。

The ability to perform "raw" I/O operations. Examples include accessing device ports or communicating directly with USB devices.

CAP_SYS_ADMIN
CAP_SYS_ADMIN

一种包罗万象的功能,提供对许多系统管理操作的访问。

A catch-all capability that provides access to many system administration operations.

CAP_SYS_TTY_CONFIG
CAP_SYS_TTY_CONFIG

执行 tty 配置任务的能力。

The ability to perform tty configuration tasks.

在执行特权操作之前,设备驱动程序应检查调用进程是否具有适当的能力;如果不这样做,可能会导致用户进程执行未经授权的操作,从而对系统稳定性或安全性产生不良影响。功能检查是使用 Capable函数(在<linux/sched.h>中定义)执行的:

Before performing a privileged operation, a device driver should check that the calling process has the appropriate capability; failure to do so could result user processes performing unauthorized operations with bad results on system stability or security. Capability checks are performed with the capable function (defined in <linux/sched.h>):

int 能力(int 能力);
 int capable(int capability);

scull示例驱动程序中,任何用户都可以查询量子和量子集大小。然而,只有特权用户才可以更改这些值,因为不适当的值可能会严重影响系统性能。当需要时, ioctl的scull实现会检查用户的权限级别,如下所示:

In the scull sample driver, any user is allowed to query the quantum and quantum set sizes. Only privileged users, however, may change those values, since inappropriate values could badly affect system performance. When needed, the scull implementation of ioctl checks a user's privilege level as follows:

if (!有能力(CAP_SYS_ADMIN))
        返回-EPERM;
 if (! capable (CAP_SYS_ADMIN))
        return -EPERM;

由于没有更具体的能力来完成此任务,因此CAP_SYS_ADMIN被选择进行此测试。

In the absence of a more specific capability for this task, CAP_SYS_ADMIN was chosen for this test.

ioctl命令的实现

The Implementation of the ioctl Commands

实现 ioctl 只传输设备的可配置参数,结果很简单,如下所示

The scull implementation of ioctl only transfers the configurable parameters of the device and turns out to be as easy as the following:

开关(cmd){

  案例 SCULL_IOCRESET:
    scull_quantum = SCULL_QUANTUM;
    scull_qset = SCULL_QSET;
    休息;
    
  case SCULL_IOCSQUANTUM: /* 设置:arg 指向的值 */
    if (!有能力(CAP_SYS_ADMIN))
        返回-EPERM;
    retval = _ _get_user(scull_quantum, (int _ _user *)arg);
    休息;

  case SCULL_IOCTQUANTUM: /* 告诉:arg 是值 */
    if (!有能力(CAP_SYS_ADMIN))
        返回-EPERM;
    scull_quantum = arg;
    休息;

  case SCULL_IOCGQUANTUM: /* Get: arg 是指向结果的指针 */
    retval = _ _put_user(scull_quantum, (int _ _user *)arg);
    休息;

  case SCULL_IOCQQUANTUM: /* 查询:返回(为正) */
    返回 scull_quantum;

  case SCULL_IOCXQUANTUM: /* eXchange: 使用 arg 作为指针 */
    if (!有能力(CAP_SYS_ADMIN))
        返回-EPERM;
    tmp = scull_quantum;
    retval = _ _get_user(scull_quantum, (int _ _user *)arg);
    如果(返回值== 0)
        retval = _ _put_user(tmp, (int _ _user *)arg);
    休息;

  case SCULL_IOCHQUANTUM: /* sHift: 就像 Tell + Query */
    if (!有能力(CAP_SYS_ADMIN))
        返回-EPERM;
    tmp = scull_quantum;
    scull_quantum = arg;
    返回tmp;

  默认值:/* 冗余,因为 cmd 已根据 MA​​XNR 检查 */
    返回-ENOTTY;
}
返回retval;
switch(cmd) {

  case SCULL_IOCRESET:
    scull_quantum = SCULL_QUANTUM;
    scull_qset = SCULL_QSET;
    break;
    
  case SCULL_IOCSQUANTUM: /* Set: arg points to the value */
    if (! capable (CAP_SYS_ADMIN))
        return -EPERM;
    retval = _ _get_user(scull_quantum, (int _ _user *)arg);
    break;

  case SCULL_IOCTQUANTUM: /* Tell: arg is the value */
    if (! capable (CAP_SYS_ADMIN))
        return -EPERM;
    scull_quantum = arg;
    break;

  case SCULL_IOCGQUANTUM: /* Get: arg is pointer to result */
    retval = _ _put_user(scull_quantum, (int _ _user *)arg);
    break;

  case SCULL_IOCQQUANTUM: /* Query: return it (it's positive) */
    return scull_quantum;

  case SCULL_IOCXQUANTUM: /* eXchange: use arg as pointer */
    if (! capable (CAP_SYS_ADMIN))
        return -EPERM;
    tmp = scull_quantum;
    retval = _ _get_user(scull_quantum, (int _ _user *)arg);
    if (retval =  = 0)
        retval = _ _put_user(tmp, (int _ _user *)arg);
    break;

  case SCULL_IOCHQUANTUM: /* sHift: like Tell + Query */
    if (! capable (CAP_SYS_ADMIN))
        return -EPERM;
    tmp = scull_quantum;
    scull_quantum = arg;
    return tmp;

  default:  /* redundant, as cmd was checked against MAXNR */
    return -ENOTTY;
}
return retval;

scull还包括六个作用于 的条目scull_qset。这些条目与 的条目相同 scull_quantum,不值得在印刷品中显示。

scull also includes six entries that act on scull_qset. These entries are identical to the ones for scull_quantum and are not worth showing in print.

从调用者的角度(即从用户空间)来看,传递和接收参数的六种方法如下所示:

The six ways to pass and receive arguments look like the following from the caller's point of view (i.e., from user space):

整数量子;

ioctl(fd,SCULL_IOCSQUANTUM, &quantum); /* 通过指针设置 */
ioctl(fd,SCULL_IOCTQUANTUM, 量子); /* 按值设置 */

ioctl(fd,SCULL_IOCGQUANTUM, &quantum); /* 通过指针获取 */
量子 = ioctl(fd,SCULL_IOCQQUANTUM); /* 通过返回值获取 */

ioctl(fd,SCULL_IOCXQUANTUM, &quantum); /* 通过指针交换 */
量子 = ioctl(fd,SCULL_IOCHQUANTUM, 量子); /* 按值交换 */
int quantum;

ioctl(fd,SCULL_IOCSQUANTUM, &quantum);          /* Set by pointer */
ioctl(fd,SCULL_IOCTQUANTUM, quantum);           /* Set by value */

ioctl(fd,SCULL_IOCGQUANTUM, &quantum);          /* Get by pointer */
quantum = ioctl(fd,SCULL_IOCQQUANTUM);          /* Get by return value */

ioctl(fd,SCULL_IOCXQUANTUM, &quantum);          /* Exchange by pointer */
quantum = ioctl(fd,SCULL_IOCHQUANTUM, quantum); /* Exchange by value */

当然,普通的驱动程序不会实现这种混合的调用模式。我们在这里这样做只是为了演示完成任务的不同方式。然而,通常情况下,数据交换将通过指针或按值一致地执行,并且将避免两种技术的混合。

Of course, a normal driver would not implement such a mix of calling modes. We have done so here only to demonstrate the different ways in which things could be done. Normally, however, data exchanges would be consistently performed, either through pointers or by value, and mixing of the two techniques would be avoided.

不使用 ioctl 进行设备控制

Device Control Without ioctl

有时,通过将控制序列写入设备本身可以更好地控制设备。例如,此技术用于控制台驱动程序,其中所谓的转义序列用于移动光标、更改默认颜色或执行其他配置任务。以这种方式实现设备控制的好处是,用户只需写入数据即可控制设备,而无需使用(或有时编写)专门为配置设备而构建的程序。当可以以这种方式控制设备时,发出命令的程序通常甚至不需要在与其控制的设备相同的系统上运行。

Sometimes controlling the device is better accomplished by writing control sequences to the device itself. For example, this technique is used in the console driver, where so-called escape sequences are used to move the cursor, change the default color, or perform other configuration tasks. The benefit of implementing device control this way is that the user can control the device just by writing data, without needing to use (or sometimes write) programs built just for configuring the device. When devices can be controlled in this manner, the program issuing commands often need not even be running on the same system as the device it is controlling.

例如, setterm程序通过打印转义序列来作用于控制台(或另一个终端)配置。控制程序可以与受控设备驻留在不同的计算机上,因为简单的数据流重定向即可完成配置工作。这是每次运行远程 tty 会话时发生的情况:转义序列被远程打印,但影响本地 tty;不过,该技术并不局限于 tty。

For example, the setterm program acts on the console (or another terminal) configuration by printing escape sequences. The controlling program can live on a different computer from the controlled device, because a simple redirection of the data stream does the configuration job. This is what happens every time you run a remote tty session: escape sequences are printed remotely but affect the local tty; the technique is not restricted to ttys, though.

通过打印控制的缺点是给设备增加了策略约束;例如,只有当您确定控制序列不会出现在正常操作期间写入设备的数据中时,它才可行。对于 ttys 来说,这只是部分正确。尽管文本显示器仅显示 ASCII 字符,但有时控制字符可能会漏入正在写入的数据中,从而影响控制台设置。例如,当您二进制文件拖到屏幕上时,就会发生这种情况;由此产生的混乱可能包含任何内容,并且您经常会在控制台上看到错误的字体。

The drawback of controlling by printing is that it adds policy constraints to the device; for example, it is viable only if you are sure that the control sequence can't appear in the data being written to the device during normal operation. This is only partly true for ttys. Although a text display is meant to display only ASCII characters, sometimes control characters can slip through in the data being written and can, therefore, affect the console setup. This can happen, for example, when you cat a binary file to the screen; the resulting mess can contain anything, and you often end up with the wrong font on your console.

对于那些不传输数据而只响应命令的设备(例如机器人设备)来说,通过写入进行控制绝对是最佳选择。

Controlling by write is definitely the way to go for those devices that don't transfer data but just respond to commands, such as robotic devices.

例如,您的一位作者为了好玩而编写的驱动程序在两个轴上移动相机。在这个驱动程序中,“设备”只是一对旧的步进电机,无法真正读取或写入。向步进电机“发送数据流”的概念几乎没有意义。在这种情况下,驱动程序将写入的内容解释为 ASCII 命令,并将请求转换为操纵步进电机的脉冲序列。这个想法在某种程度上类似于您发送到调制解调器以建立通信的 AT 命令,主要区别在于用于与调制解调器通信的串行端口也必须传输真实数据。移动相机而无需编写和编译特殊代码来发出ioctl调用。

For instance, a driver written for fun by one of your authors moves a camera on two axes. In this driver, the "device" is simply a pair of old stepper motors, which can't really be read from or written to. The concept of "sending a data stream" to a stepper motor makes little or no sense. In this case, the driver interprets what is being written as ASCII commands and converts the requests to sequences of impulses that manipulate the stepper motors. The idea is similar, somewhat, to the AT commands you send to the modem in order to set up communication, the main difference being that the serial port used to communicate with the modem must transfer real data as well. The advantage of direct device control is that you can use cat to move the camera without writing and compiling special code to issue the ioctl calls.

当编写面向命令的驱动程序时,没有理由实现 ioctl方法。解释器中的附加命令更容易实现和使用。

When writing command-oriented drivers, there's no reason to implement the ioctl method. An additional command in the interpreter is easier to implement and use.

但有时,您可能会选择相反的做法:您可能会选择完全避免write 并专门使用ioctl命令,同时为驱动程序提供特定的命令行,而不是将write方法转换为解释 器并避免使用ioctl将这些命令发送给驱动程序的工具。这种方法将复杂性从内核空间转移到用户空间,在用户空间可能更容易处理,并有助于在拒绝驱动程序的同时保持较小的驱动程序 使用简单的catecho命令。

Sometimes, though, you might choose to act the other way around: instead of turning the write method into an interpreter and avoiding ioctl, you might choose to avoid write altogether and use ioctl commands exclusively, while accompanying the driver with a specific command-line tool to send those commands to the driver. This approach moves the complexity from kernel space to user space, where it may be easier to deal with, and helps keep the driver small while denying use of simple cat or echo commands.

阻塞 I/O

Blocking I/O

回到第 3 章,我们了解了如何 实现 读取写入驱动程序方法。然而,此时我们忽略了一个重要问题:如果驱动程序无法立即满足请求,它将如何响应?当没有可用数据时,可能会调用read,但预计将来会有更多数据。或者进程可能尝试 写入,但您的设备尚未准备好接受数据,因为您的输出缓冲区已满。调用进程通常不关心此类问题;程序员只是希望调用readwrite 并在完成必要的工作后让呼叫返回。因此,在这种情况下,您的驱动程序应该(默认情况下)阻止该进程,使其进入睡眠状态,直到请求可以继续。

Back in Chapter 3, we looked at how to implement the read and write driver methods. At that point, however, we skipped over one important issue: how does a driver respond if it cannot immediately satisfy the request? A call to read may come when no data is available, but more is expected in the future. Or a process could attempt to write, but your device is not ready to accept the data, because your output buffer is full. The calling process usually does not care about such issues; the programmer simply expects to call read or write and have the call return after the necessary work has been done. So, in such cases, your driver should (by default) block the process, putting it to sleep until the request can proceed.

本节介绍如何使进程进入睡眠状态并稍后再次唤醒它。然而,像往常一样,我们必须首先解释一些概念。

This section shows how to put a process to sleep and wake it up again later on. As usual, however, we have to explain a few concepts first.

睡眠简介

Introduction to Sleeping

进程“睡眠”意味着什么?当进程进入睡眠状态时,它被标记为处于特殊状态并从调度程序的运行队列中删除。在发生改变该状态之前,该进程不会在任何 CPU 上进行调度,因此不会运行。睡眠进程已被转移到系统一侧,等待未来事件的发生。

What does it mean for a process to "sleep"? When a process is put to sleep, it is marked as being in a special state and removed from the scheduler's run queue. Until something comes along to change that state, the process will not be scheduled on any CPU and, therefore, will not run. A sleeping process has been shunted off to the side of the system, waiting for some future event to happen.

对于 Linux 设备驱动程序来说,使进程进入睡眠状态是一件很容易的事情。但是,您必须牢记一些规则,以便能够以安全的方式对睡眠进行编码。

Causing a process to sleep is an easy thing for a Linux device driver to do. There are, however, a couple of rules that you must keep in mind to be able to code sleeps in a safe manner.

这些规则中的第一条是:当您在原子上下文中运行时,永远不要休眠。原子上下文只是一种必须执行多个步骤而无需任何并发访问的状态。就睡眠而言,这意味着您的驱动程序在持有自旋锁、seqlock 或 RCU 锁时无法睡眠。如果您禁用了中断,您也无法睡眠。它持有信号量时休眠是合法的,但是您应该非常仔细地查看这样做的任何代码。如果代码在持有信号量时休眠,则等待该信号量的任何其他线程也会休眠。因此,在持有信号量时发生的任何睡眠都应该很短,并且您应该说服自己,通过持有信号量,您不会阻塞最终唤醒您的进程。

The first of these rules is: never sleep when you are running in an atomic context. An atomic context is simply a state where multiple steps must be performed without any sort of concurrent access. What that means, with regard to sleeping, is that your driver cannot sleep while holding a spinlock, seqlock, or RCU lock. You also cannot sleep if you have disabled interrupts. It is legal to sleep while holding a semaphore, but you should look very carefully at any code that does so. If code sleeps while holding a semaphore, any other thread waiting for that semaphore also sleeps. So any sleeps that happen while holding semaphores should be short, and you should convince yourself that, by holding the semaphore, you are not blocking the process that will eventually wake you up.

睡眠时要记住的另一件事是,当您醒来时,您永远不知道您的进程可能已经脱离 CPU 了多长时间,或者同时发生了什么变化。您通常也不知道另一个进程是否正在为同一事件休眠;该进程可能会在您之前醒来并获取您正在等待的任何资源。最终结果是,您在醒来后无法对系统的状态做出任何假设,并且您必须检查以确保您正在等待的条件确实为真。

Another thing to remember with sleeping is that, when you wake up, you never know how long your process may have been out of the CPU or what may have changed in the mean time. You also do not usually know if another process may have been sleeping for the same event; that process may wake before you and grab whatever resource you were waiting for. The end result is that you can make no assumptions about the state of the system after you wake up, and you must check to ensure that the condition you were waiting for is, indeed, true.

当然,另一个相关点是,除非确保某个地方的其他人会唤醒您的进程,否则您的进程无法休眠。进行唤醒的代码还必须能够找到您的进程才能完成其工作。确保唤醒发生需要仔细思考您的代码,并了解每次睡眠时到底发生哪些事件系列会结束该睡眠。相反,可以通过称为等待队列的数据结构来找到睡眠进程 。等待队列顾名思义:一个进程列表,所有进程都在等待特定事件。

One other relevant point, of course, is that your process cannot sleep unless it is assured that somebody else, somewhere, will wake it up. The code doing the awakening must also be able to find your process to be able to do its job. Making sure that a wakeup happens is a matter of thinking through your code and knowing, for each sleep, exactly what series of events will bring that sleep to an end. Making it possible for your sleeping process to be found is, instead, accomplished through a data structure called a wait queue . A wait queue is just what it sounds like: a list of processes, all waiting for a specific event.

在 Linux 中,等待队列通过“等待队列头”进行管理,这是一种类型的结构 ,在<linux/wait.h>wait_queue_head_t中定义。等待队列头可以通过以下方式静态定义和初始化:

In Linux, a wait queue is managed by means of a "wait queue head," a structure of type wait_queue_head_t, which is defined in <linux/wait.h>. A wait queue head can be defined and initialized statically with:

DECLARE_WAIT_QUEUE_HEAD(名称);
DECLARE_WAIT_QUEUE_HEAD(name);

或动态如下:

or dynamicly as follows:

wait_queue_head_t my_queue;
init_waitqueue_head(&my_queue);
wait_queue_head_t my_queue;
init_waitqueue_head(&my_queue);

我们很快就会回到等待队列的结构,但我们现在已经了解得足够多了,可以先看看睡眠和唤醒。

We will return to the structure of wait queues shortly, but we know enough now to take a first look at sleeping and waking up.

简单的睡眠

Simple Sleeping

当进程休眠时,它会这样做期望某些条件将来会成为现实。正如我们之前提到的,任何休眠的进程都必须进行检查,以确保它再次唤醒时所等待的条件确实为真。Linux内核中最简单的休眠方式是一个名为 wait_event的宏 (有一些变体);它将处理睡眠细节与检查进程正在等待的条件结合起来。wait_event的形式有:

When a process sleeps, it does so in expectation that some condition will become true in the future. As we noted before, any process that sleeps must check to be sure that the condition it was waiting for is really true when it wakes up again. The simplest way of sleeping in the Linux kernel is a macro called wait_event (with a few variants); it combines handling the details of sleeping with a check on the condition a process is waiting for. The forms of wait_event are:

wait_event(队列,条件)
wait_event_interruptible(队列,条件)
wait_event_timeout(队列、条件、超时)
wait_event_interruptible_timeout(队列,条件,超时)
wait_event(queue, condition)
wait_event_interruptible(queue, condition)
wait_event_timeout(queue, condition, timeout)
wait_event_interruptible_timeout(queue, condition, timeout)

在以上所有形式中,queue都是要使用的等待队列头。请注意,它是“按值”传递的。是condition一个任意布尔表达式,在睡眠之前和之后由宏计算;直到condition计算出真值,该进程才会继续休眠。请注意,condition可以评估任意次数,因此它不应该有任何副作用。

In all of the above forms, queue is the wait queue head to use. Notice that it is passed "by value." The condition is an arbitrary boolean expression that is evaluated by the macro before and after sleeping; until condition evaluates to a true value, the process continues to sleep. Note that condition may be evaluated an arbitrary number of times, so it should not have any side effects.

如果您使用wait_event,您的进程将进入不间断睡眠状态,正如我们之前提到的,这通常不是您想要的。首选的替代方案是wait_event_interruptible,它可以被信号中断。此版本返回一个整数值,您应该检查该值;非零值意味着您的睡眠被某种信号中断,并且您的驱动程序可能应该返回-ERESTARTSYS。最终版本(wait_event_timeoutwait_event_interruptible_timeout)等待有限的时间;在该时间段之后(以 jiffies 表示,我们将在第 7 章中讨论)0) 过期,无论如何计算,宏都会返回一个值 condition

If you use wait_event, your process is put into an uninterruptible sleep which, as we have mentioned before, is usually not what you want. The preferred alternative is wait_event_interruptible, which can be interrupted by signals. This version returns an integer value that you should check; a nonzero value means your sleep was interrupted by some sort of signal, and your driver should probably return -ERESTARTSYS. The final versions (wait_event_timeout and wait_event_interruptible_timeout) wait for a limited time; after that time period (expressed in jiffies, which we will discuss in Chapter 7) expires, the macros return with a value of 0 regardless of how condition evaluates.

当然,画面的另一半正在苏醒。某些其他执行线程(可能是不同的进程,或者中断处理程序)必须为您执行唤醒,因为您的进程当然处于睡眠状态。唤醒睡眠进程的基本函数称为wake_up 。它有多种形式(但我们现在只看其中两种):

The other half of the picture, of course, is waking up. Some other thread of execution (a different process, or an interrupt handler, perhaps) has to perform the wakeup for you, since your process is, of course, asleep. The basic function that wakes up sleeping processes is called wake_up . It comes in several forms (but we look at only two of them now):

无效wake_up(wait_queue_head_t *队列);
无效wake_up_interruptible(wait_queue_head_t *队列);
void wake_up(wait_queue_head_t *queue);
void wake_up_interruptible(wait_queue_head_t *queue);

wake_up唤醒等待给定的所有进程queue(尽管情况比这更复杂,我们稍后会看到)。另一种形式( wake_up_interruptible)将自身限制为执行可中断睡眠的进程。一般来说,两者是没有区别的(如果您使用可中断睡眠);实际上,约定是如果使用wait_event则使用wake_up, 如果使用 wait_event_interruptible则使用wake_up_interruptible

wake_up wakes up all processes waiting on the given queue (though the situation is a little more complicated than that, as we will see later). The other form (wake_up_interruptible) restricts itself to processes performing an interruptible sleep. In general, the two are indistinguishable (if you are using interruptible sleeps); in practice, the convention is to use wake_up if you are using wait_event and wake_up_interruptible if you use wait_event_interruptible.

现在我们已经了解了足够多的知识,可以看一下睡眠和醒来的简单例子。在示例源代码中,您可以找到一个名为sleepy的模块。它实现了一个具有简单行为的设备:任何尝试从该设备读取数据的进程都会被置于睡眠状态。每当进程写入设备时,所有休眠进程都会被唤醒。此行为是通过以下读取写入方法实现的:

We now know enough to look at a simple example of sleeping and waking up. In the sample source, you can find a module called sleepy. It implements a device with simple behavior: any process that attempts to read from the device is put to sleep. Whenever a process writes to the device, all sleeping processes are awakened. This behavior is implemented with the following read and write methods:

静态 DECLARE_WAIT_QUEUE_HEAD(wq);
静态 int 标志 = 0;

ssize_t sleepy_read (struct file *filp, char _ _user *buf, size_t count, loff_t *pos)
{
    printk(KERN_DEBUG "进程 %i (%s) 即将进入睡眠状态\n",
            当前->pid,当前->comm);
    wait_event_interruptible(wq, 标志!= 0);
    标志= 0;
    printk(KERN_DEBUG "唤醒 %i (%s)\n", current->pid, current->comm);
    返回0;/* EOF */
}

ssize_t sleepy_write (struct file *filp, const char _ _user *buf, size_t count,
        loff_t *位置)
{
    printk(KERN_DEBUG "进程 %i (%s) 唤醒读者...\n",
            当前->pid,当前->comm);
    标志= 1;
    wake_up_interruptible(&wq);
    返回计数;/* 成功,避免重试 */
}
static DECLARE_WAIT_QUEUE_HEAD(wq);
static int flag = 0;

ssize_t sleepy_read (struct file *filp, char _ _user *buf, size_t count, loff_t *pos)
{
    printk(KERN_DEBUG "process %i (%s) going to sleep\n",
            current->pid, current->comm);
    wait_event_interruptible(wq, flag != 0);
    flag = 0;
    printk(KERN_DEBUG "awoken %i (%s)\n", current->pid, current->comm);
    return 0; /* EOF */
}

ssize_t sleepy_write (struct file *filp, const char _ _user *buf, size_t count,
        loff_t *pos)
{
    printk(KERN_DEBUG "process %i (%s) awakening the readers...\n",
            current->pid, current->comm);
    flag = 1;
    wake_up_interruptible(&wq);
    return count; /* succeed, to avoid retrial */
}

请注意此示例中变量的使用flag。由于wait_event_interruptible检查必须变为 true 的条件,因此我们使用flag它来创建该条件。

Note the use of the flag variable in this example. Since wait_event_interruptible checks for a condition that must become true, we use flag to create that condition.

考虑一下如果调用sleepy_write时两个进程正在等待会发生什么,这是很有趣的。由于 sleepy_read会在 唤醒后重置,因此您可能会认为第二个唤醒的进程会立即返回睡眠状态。在单处理器系统上,这几乎总是会发生。但重要的是要理解为什么你不能指望这种行为。wake_up_interruptible调用 将导致两个睡眠进程唤醒。他们完全有可能都会注意到flag0flag在任何一个有机会重置它之前都是非零的。对于这个简单的模块,这种竞争条件并不重要。对于真正的车手来说,这种比赛可能会造成罕见且难以诊断的碰撞事故。如果正确的操作要求恰好有一个进程看到非零值,则必须以原子方式对其进行测试。我们很快就会看到真正的驾驶员如何处理这种情况。但首先我们必须讨论另一个主题。

It is interesting to consider what happens if two processes are waiting when sleepy_write is called. Since sleepy_read resets flag to 0 once it wakes up, you might think that the second process to wake up would immediately go back to sleep. On a single-processor system, that is almost always what happens. But it is important to understand why you cannot count on that behavior. The wake_up_interruptible call will cause both sleeping processes to wake up. It is entirely possible that they will both note that flag is nonzero before either has the opportunity to reset it. For this trivial module, this race condition is unimportant. In a real driver, this kind of race can create rare crashes that are difficult to diagnose. If correct operation required that exactly one process see the nonzero value, it would have to be tested in an atomic manner. We will see how a real driver handles such situations shortly. But first we have to cover one other topic.

阻塞和非阻塞操作

Blocking and Nonblocking Operations

我们需要谈的最后一点 在我们查看全功能读写方法 的实现之前,我们先决定何时 让进程进入睡眠状态。有时,实现正确的 Unix 语义要求操作不被阻塞,即使它无法完全执行。

One last point we need to touch on before we look at the implementation of full-featured read and write methods is deciding when to put a process to sleep. There are times when implementing proper Unix semantics requires that an operation not block, even if it cannot be completely carried out.

有时,调用进程会通知您它不想 阻塞,无论其 I/O 是否可以取得任何进展。O_NONBLOCK显式非阻塞 I/O 由中的标志指示filp->f_flags该标志在<linux/fcntl.h>中定义,它自动包含在<linux/fs.h>中。该标志的名称源自“open-nonblock”,因为它可以在打开时指定(并且最初只能在打开时指定)。如果您浏览源代码,您会发现一些对O_NDELAY标志的引用;这是一个替代名称O_NONBLOCK,因与 System V 代码兼容而被接受。默认情况下该标志被清除,因为等待数据的进程的正常行为只是休眠。对于默认的阻塞操作,应实现以下行为以遵守标准语义:

There are also times when the calling process informs you that it does not want to block, whether or not its I/O can make any progress at all. Explicitly nonblocking I/O is indicated by the O_NONBLOCK flag in filp->f_flags. The flag is defined in <linux/fcntl.h>, which is automatically included by <linux/fs.h>. The flag gets its name from "open-nonblock," because it can be specified at open time (and originally could be specified only there). If you browse the source code, you find some references to an O_NDELAY flag; this is an alternate name for O_NONBLOCK, accepted for compatibility with System V code. The flag is cleared by default, because the normal behavior of a process waiting for data is just to sleep. In the case of a blocking operation, which is the default, the following behavior should be implemented in order to adhere to the standard semantics:

  • 如果进程调用read但没有可用数据,则该进程必须阻塞。一旦有数据到达,进程就会被唤醒,并且该数据会返回给调用者,即使数据少于方法参数中请求的数量 count

  • If a process calls read but no data is (yet) available, the process must block. The process is awakened as soon as some data arrives, and that data is returned to the caller, even if there is less than the amount requested in the count argument to the method.

  • 如果进程调用write并且缓冲区中没有空间,则该进程必须阻塞,并且它必须位于与用于读取的等待队列不同的等待队列上。当一些数据已写入硬件设备并且输出缓冲区中的空间空闲时,进程将被唤醒并且写入调用成功count,尽管如果缓冲区中没有字节空间,则数据可能仅部分写入所要求的。

  • If a process calls write and there is no space in the buffer, the process must block, and it must be on a different wait queue from the one used for reading. When some data has been written to the hardware device, and space becomes free in the output buffer, the process is awakened and the write call succeeds, although the data may be only partially written if there isn't room in the buffer for the count bytes that were requested.

这两个陈述都假设有输入和输出缓冲区;实际上,几乎每个设备驱动程序都有它们。输入缓冲区需要避免丢失无人读取时到达的数据。相反,写入时数据不会丢失 ,因为如果系统调用不接受数据字节,它们将保留在用户空间缓冲区中。即便如此,输出缓冲区对于从硬件中榨取更多性能几乎总是有用的。

Both these statements assume that there are both input and output buffers; in practice, almost every device driver has them. The input buffer is required to avoid losing data that arrives when nobody is reading. In contrast, data can't be lost on write, because if the system call doesn't accept data bytes, they remain in the user-space buffer. Even so, the output buffer is almost always useful for squeezing more performance out of the hardware.

在驱动程序中实现输出缓冲区的性能增益源于上下文切换和用户级/内核级转换数量的减少。如果没有输出缓冲区(假设设备速度较慢),则每个系统调用仅接受一个或几个字符,并且当一个进程在 write中休眠时,另一个进程会运行(这是一个上下文切换)。当第一个进程被唤醒时,它恢复(另一个上下文切换),write 返回(内核/用户转换),进程重复系统调用以写入更多数据(用户/内核转换);调用阻塞并且循环继续。添加输出缓冲区允许驱动程序每次接受更大的数据块 write调用,性能也相应提高。如果该缓冲区足够大,则写入调用在第一次尝试时就会成功 - 缓冲的数据稍后将被推送到设备 - 无需控制返回到用户空间进行第二次或第三次写入调用。输出缓冲区合适大小的选择显然是特定于设备的。

The performance gain of implementing an output buffer in the driver results from the reduced number of context switches and user-level/kernel-level transitions. Without an output buffer (assuming a slow device), only one or a few characters are accepted by each system call, and while one process sleeps in write, another process runs (that's one context switch). When the first process is awakened, it resumes (another context switch), write returns (kernel/user transition), and the process reiterates the system call to write more data (user/kernel transition); the call blocks and the loop continues. The addition of an output buffer allows the driver to accept larger chunks of data with each write call, with a corresponding increase in performance. If that buffer is big enough, the write call succeeds on the first attempt—the buffered data will be pushed out to the device later—without control needing to go back to user space for a second or third write call. The choice of a suitable size for the output buffer is clearly device-specific.

我们不在scull中使用输入缓冲区,因为发出读取时数据已经可用。同样,不使用输出缓冲区,因为数据只是复制到与设备关联的存储区域。本质上,该设备一个缓冲区,因此额外缓冲区的实现是多余的。我们将在第 10 章中看到缓冲区的使用。

We don't use an input buffer in scull, because data is already available when read is issued. Similarly, no output buffer is used, because data is simply copied to the memory area associated with the device. Essentially, the device is a buffer, so the implementation of additional buffers would be superfluous. We'll see the use of buffers in Chapter 10.

如果指定的话,读取写入的行为是不同的O_NONBLOCK在这种情况下,如果进程在没有可用数据时调用read或在缓冲区中没有空间时调用 write,则调用只会返回-EAGAIN(“重试一次”) 。

The behavior of read and write is different if O_NONBLOCK is specified. In this case, the calls simply return -EAGAIN ("try it again") if a process calls read when no data is available or if it calls write when there's no space in the buffer.

正如您所期望的,非阻塞操作立即返回,允许应用程序轮询数据。应用程序在处理非阻塞文件时使用stdio函数时必须小心 ,因为它们很容易将非阻塞返回误认为EOF. 他们总是要检查errno

As you might expect, nonblocking operations return immediately, allowing the application to poll for data. Applications must be careful when using the stdio functions while dealing with nonblocking files, because they can easily mistake a nonblocking return for EOF. They always have to check errno.

当然,在开放O_NONBLOCK方法中也是有意义的 。当调用实际上可能阻塞很长时间时,就会发生这种情况;例如,当打开(用于读访问)一个还没有写入器的 FIFO 时,或者访问带有挂起锁的磁盘文件时。通常,打开设备要么成功,要么失败,不需要等待外部事件。然而,有时,打开设备需要长时间的初始化,并且您可以选择在open方法中支持,如果设置了标志,则在启动设备初始化过程后立即返回。驱动程序还可以实现阻塞 打开O_NONBLOCK-EAGAIN以类似于文件锁的方式支持访问策略。我们将在本章后面的6.6.3 节中看到一个这样的实现。

Naturally, O_NONBLOCK is meaningful in the open method also. This happens when the call can actually block for a long time; for example, when opening (for read access) a FIFO that has no writers (yet), or accessing a disk file with a pending lock. Usually, opening a device either succeeds or fails, without the need to wait for external events. Sometimes, however, opening the device requires a long initialization, and you may choose to support O_NONBLOCK in your open method by returning immediately with -EAGAIN if the flag is set, after starting the device initialization process. The driver may also implement a blocking open to support access policies in a way similar to file locks. We'll see one such implementation in Section 6.6.3 later in this chapter.

某些驱动程序还可能实现特殊语义O_NONBLOCK;例如,磁带设备的打开通常会阻塞,直到插入磁带为止。如果使用 打开磁带机O_NONBLOCK,则无论介质是否存在,打开都会立即成功。

Some drivers may also implement special semantics for O_NONBLOCK; for example, an open of a tape device usually blocks until a tape has been inserted. If the tape drive is opened with O_NONBLOCK, the open succeeds immediately regardless of whether the media is present or not.

只有打开文件操作受非阻塞标志影响。

Only the read, write, and open file operations are affected by the nonblocking flag.

阻塞 I/O 示例

A Blocking I/O Example

最后,我们举一个例子 实现阻塞 I/O 的真正驱动程序方法。这个例子取自 scullpipe驱动程序;它是一种特殊形式的 双桨,实现了类似管道的设备。

Finally, we get to an example of a real driver method that implements blocking I/O. This example is taken from the scullpipe driver; it is a special form of scull that implements a pipe-like device.

在驱动程序中,当数据到达时,阻塞在读取调用中的进程会被唤醒;通常,硬件会发出中断来发出此类事件的信号,并且驱动程序会唤醒等待进程作为处理中断的一部分。scullpipe驱动程序 的工作方式不同,因此它可以在不需要任何特定硬件或中断处理程序的情况下运行。我们选择使用另一个进程来生成数据并唤醒读取进程;类似地,读取进程用于唤醒正在等待缓冲区空间变得可用的写入进程。

Within a driver, a process blocked in a read call is awakened when data arrives; usually the hardware issues an interrupt to signal such an event, and the driver awakens waiting processes as part of handling the interrupt. The scullpipe driver works differently, so that it can be run without requiring any particular hardware or an interrupt handler. We chose to use another process to generate the data and wake the reading process; similarly, reading processes are used to wake writer processes that are waiting for buffer space to become available.

设备驱动程序使用包含两个等待队列和一个缓冲区的设备结构。缓冲区的大小可以通过通常的方式配置(在编译时、加载时或运行时)。

The device driver uses a device structure that contains two wait queues and a buffer. The size of the buffer is configurable in the usual ways (at compile time, load time, or runtime).

结构 scull_pipe {
        wait_queue_head_t inq, outq; /* 读写队列 */
        字符*缓冲区,*结束;/* buf 开始, buf 结束 */
        int 缓冲区大小;/* 用于指针算术 */
        字符*rp,*wp;/* 读到哪里,写到哪里 */
        int nreaders、nwriters;/* r/w 的开口数 */
        struct fasync_struct *async_queue; /* 异步读取器 */
        结构信号量 sem;/* 互斥信号量 */
        结构体cdev cdev;/* Char设备结构*/
};
struct scull_pipe {
        wait_queue_head_t inq, outq;       /* read and write queues */
        char *buffer, *end;                /* begin of buf, end of buf */
        int buffersize;                    /* used in pointer arithmetic */
        char *rp, *wp;                     /* where to read, where to write */
        int nreaders, nwriters;            /* number of openings for r/w */
        struct fasync_struct *async_queue; /* asynchronous readers */
        struct semaphore sem;              /* mutual exclusion semaphore */
        struct cdev cdev;                  /* Char device structure */
};

读取实现管理阻塞和非阻塞输入如下所示:

The read implementation manages both blocking and nonblocking input and looks like this:

static ssize_t scull_p_read (struct file *filp, char _ _user *buf, size_t count,
                loff_t *f_pos)
{
    struct scull_pipe *dev = filp->private_data;

    if (down_interruptible(&dev->sem))
        返回-ERESTARTSYS;

    while (dev->rp == dev->wp) { /* 没有什么可读的 */
        向上(&dev->sem);/* 释放锁 */
        if (filp->f_flags & O_NONBLOCK)
            返回-EAGAIN;
        PDEBUG("\"%s\" 正在阅读:要睡觉\n", current->comm);
        if (wait_event_interruptible(dev->inq, (dev->rp != dev->wp)))
            返回-ERESTARTSYS;/* signal: 告诉fs层处理它 */
        /* 否则循环,但首先重新获取锁 */
        if (down_interruptible(&dev->sem))
            返回-ERESTARTSYS;
    }
    /* 好的,数据已经存在,返回一些东西 */
    if (dev->wp > dev->rp)
        计数 = min(计数, (size_t)(dev->wp - dev->rp));
    else /* 写指针已回绕,返回数据直至 dev->end */
        count = min(count, (size_t)(dev->end - dev->rp));
    if (copy_to_user(buf, dev->rp, count)) {
        向上 (&dev->sem);
        返回-EFAULT;
    }
    dev->rp += 计数;
    if (dev->rp == dev->end)
        dev->rp = dev->buffer; /* 包裹 */
    向上 (&dev->sem);

    /* 最后,唤醒所有写入者并返回 */
    wake_up_interruptible(&dev->outq);
    PDEBUG("\"%s\" 已读取 %li 个字节\n",current->comm, (long)count);
    返回计数;
}
static ssize_t scull_p_read (struct file *filp, char _ _user *buf, size_t count,
                loff_t *f_pos)
{
    struct scull_pipe *dev = filp->private_data;

    if (down_interruptible(&dev->sem))
        return -ERESTARTSYS;

    while (dev->rp =  = dev->wp) { /* nothing to read */
        up(&dev->sem); /* release the lock */
        if (filp->f_flags & O_NONBLOCK)
            return -EAGAIN;
        PDEBUG("\"%s\" reading: going to sleep\n", current->comm);
        if (wait_event_interruptible(dev->inq, (dev->rp != dev->wp)))
            return -ERESTARTSYS; /* signal: tell the fs layer to handle it */
        /* otherwise loop, but first reacquire the lock */
        if (down_interruptible(&dev->sem))
            return -ERESTARTSYS;
    }
    /* ok, data is there, return something */
    if (dev->wp > dev->rp)
        count = min(count, (size_t)(dev->wp - dev->rp));
    else /* the write pointer has wrapped, return data up to dev->end */
        count = min(count, (size_t)(dev->end - dev->rp));
    if (copy_to_user(buf, dev->rp, count)) {
        up (&dev->sem);
        return -EFAULT;
    }
    dev->rp += count;
    if (dev->rp =  = dev->end)
        dev->rp = dev->buffer; /* wrapped */
    up (&dev->sem);

    /* finally, awake any writers and return */
    wake_up_interruptible(&dev->outq);
    PDEBUG("\"%s\" did read %li bytes\n",current->comm, (long)count);
    return count;
}

正如您所看到的,我们PDEBUG在代码中留下了一些语句。编译驱动程序时,您可以启用消息传递,以便更轻松地跟踪不同进程的交互。

As you can see, we left some PDEBUG statements in the code. When you compile the driver, you can enable messaging to make it easier to follow the interaction of different processes.

让我们仔细看看scull_p_read是如何处理等待数据的。该while循环在保持设备信号量的情况下测试缓冲区。如果那里有数据,我们知道我们可以立即将其返回给用户而无需休眠,因此整个循环体将被跳过。相反,如果缓冲区是空的,我们就必须睡觉。然而,在此之前,我们必须删除设备信号量;如果我们抱着它睡觉,没有作家有机会叫醒我们。一旦信号量被删除,我们就会快速检查用户是否请求了非阻塞 I/O,如果是则返回。否则,是时候调用 wait_event_interruptible了。

Let us look carefully at how scull_p_read handles waiting for data. The while loop tests the buffer with the device semaphore held. If there is data there, we know we can return it to the user immediately without sleeping, so the entire body of the loop is skipped. If, instead, the buffer is empty, we must sleep. Before we can do that, however, we must drop the device semaphore; if we were to sleep holding it, no writer would ever have the opportunity to wake us up. Once the semaphore has been dropped, we make a quick check to see if the user has requested non-blocking I/O, and return if so. Otherwise, it is time to call wait_event_interruptible.

一旦我们过了那个电话,就有什么东西把我们吵醒了,但我们不知道是什么。一种可能性是进程收到了信号。if包含wait_event_interruptible调用的语句检查这种情况。这个声明确保了对信号的正确和预期的反应,这可能负责唤醒进程(因为我们处于可中断的睡眠中)。如果信号已到达并且未被进程阻止,则正确的行为是让内核的上层处理该事件。为此,驱动程序返回-ERESTARTSYS给调用者;该值由虚拟文件系统 (VFS) 层在内部使用,该层要么重新启动系统调用,要么返回 -EINTR到用户空间。我们使用相同类型的检查来处理每次读取和 读取的信号处理实现。

Once we get past that call, something has woken us up, but we do not know what. One possibility is that the process received a signal. The if statement that contains the wait_event_interruptible call checks for this case. This statement ensures the proper and expected reaction to signals, which could have been responsible for waking up the process (since we were in an interruptible sleep). If a signal has arrived and it has not been blocked by the process, the proper behavior is to let upper layers of the kernel handle the event. To this end, the driver returns -ERESTARTSYS to the caller; this value is used internally by the virtual filesystem (VFS) layer, which either restarts the system call or returns -EINTR to user space. We use the same type of check to deal with signal handling for every read and write implementation.

然而,即使没有信号,我们也无法确定是否有数据可供获取。其他人也可能一直在等待数据,他们可能会赢得比赛并首先获得数据。所以我们必须重新获取设备信号量;只有这样我们才能再次测试读取缓冲区(在循环中while)并真正知道我们可以将缓冲区中的数据返回给用户。所有这些代码的最终结果是,当我们退出循环时while,我们知道信号量已被保存并且缓冲区包含我们可以使用的数据。

However, even in the absence of a signal, we do not yet know for sure that there is data there for the taking. Somebody else could have been waiting for data as well, and they might win the race and get the data first. So we must acquire the device semaphore again; only then can we test the read buffer again (in the while loop) and truly know that we can return the data in the buffer to the user. The end result of all this code is that, when we exit from the while loop, we know that the semaphore is held and the buffer contains data that we can use.

为了完整起见,让我们注意scull_p_read在我们获取设备信号量后可以在另一个位置休眠:对copy_to_user的调用 。如果scull在内核和用户空间之间复制数据时休眠,它会在保持设备信号量的情况下休眠。在这种情况下持有信号量是合理的,因为它不会使系统死锁(我们知道内核将执行到用户空间的复制并唤醒我们,而不会尝试在进程中锁定相同的信号量),并且因为重要的是当驱动程序休眠时,设备内存阵列不会改变。

Just for completeness, let us note that scull_p_read can sleep in another spot after we take the device semaphore: the call to copy_to_user. If scull sleeps while copying data between kernel and user space, it sleeps with the device semaphore held. Holding the semaphore in this case is justified since it does not deadlock the system (we know that the kernel will perform the copy to user space and wakes us up without trying to lock the same semaphore in the process), and since it is important that the device memory array not change while the driver sleeps.

高级睡眠

Advanced Sleeping

许多驾驶员能够通过我们到目前为止介绍的功能来满足他们的睡眠要求。然而,在某些情况下,需要更深入地了解 Linux 等待队列机制的工作原理。复杂的锁定或性能要求可能会迫使驱动程序使用较低级别的功能来实现睡眠。在本节中,我们将着眼于较低级别,以了解进程休眠时到底发生了什么。

Many drivers are able to meet their sleeping requirements with the functions we have covered so far. There are situations, however, that call for a deeper understanding of how the Linux wait queue mechanism works. Complex locking or performance requirements can force a driver to use lower-level functions to effect a sleep. In this section, we look at the lower level to get an understanding of what is really going on when a process sleeps.

进程如何休眠

How a process sleeps

如果您查看<linux/wait.h>内部,您会发现该wait_queue_head_t类型背后的数据结构非常简单;它由一个自旋锁和一个链表组成。该列表中包含一个等待队列条目,它是用类型声明的wait_queue_t。该结构包含有关睡眠进程的信息以及它希望如何被唤醒的信息。

If you look inside <linux/wait.h>, you see that the data structure behind the wait_queue_head_t type is quite simple; it consists of a spinlock and a linked list. What goes on to that list is a wait queue entry, which is declared with the type wait_queue_t. This structure contains information about the sleeping process and exactly how it would like to be woken up.

让进程进入睡眠状态的第一步通常是分配和初始化结构wait_queue_t,然后将其添加到适当的等待队列中。当一切就绪后,负责唤醒的人将能够找到正确的进程。

The first step in putting a process to sleep is usually the allocation and initialization of a wait_queue_t structure, followed by its addition to the proper wait queue. When everything is in place, whoever is charged with doing the wakeup will be able to find the right processes.

下一步是设置进程的状态以将其标记为睡眠状态。<linux/sched.h>中定义了多种任务状态。TASK_RUNNING 意味着该进程能够运行,尽管它不一定在任何特定时刻在处理器中执行。有两种状态表明进程处于睡眠状态:TASK_INTERRUPTIBLETASK_UNINTERRUPTIBLE;当然,它们对应于两种类型的睡眠。驱动程序编写者通常不关心其他状态。

The next step is to set the state of the process to mark it as being asleep. There are several task states defined in <linux/sched.h>. TASK_RUNNING means that the process is able to run, although it is not necessarily executing in the processor at any specific moment. There are two states that indicate that a process is asleep: TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE; they correspond, of course, to the two types of sleep. The other states are not normally of concern to driver writers.

在 2.6 内核中,驱动程序代码通常不需要直接操作进程状态。但是,如果您需要这样做,则使用的调用是:

In the 2.6 kernel, it is not normally necessary for driver code to manipulate the process state directly. However, should you need to do so, the call to use is:

无效 set_current_state(int new_state);
void set_current_state(int new_state);

在较旧的代码中,您经常会看到类似这样的内容:

In older code, you often see something like this instead:

当前->状态 = TASK_INTERRUPTIBLE;
current->state = TASK_INTERRUPTIBLE;

current不鼓励直接以这种方式进行更改;当数据结构改变时,这样的代码很容易被破坏。然而,上面的代码确实表明,更改进程的当前状态本身并不会使它进入睡眠状态。通过更改当前状态,您已经更改了调度程序处理进程的方式,但尚未产生处理器。

But changing current directly in that manner is discouraged; such code breaks easily when data structures change. The above code does show, however, that changing the current state of a process does not, by itself, put it to sleep. By changing the current state, you have changed the way the scheduler treats a process, but you have not yet yielded the processor.

放弃处理器是最后一步,但首先要做的一件事是:您必须首先检查您正在睡眠的状况。如果不进行此检查,则会引发竞争条件;如果在您参与上述过程时条件成立,并且其他某个线程刚刚试图唤醒您,会发生什么情况?你可能会完全错过起床时间,并且睡得比你预期的时间要长。因此,在休眠的代码内部,您通常会看到如下内容:

Giving up the processor is the final step, but there is one thing to do first: you must check the condition you are sleeping for first. Failure to do this check invites a race condition; what happens if the condition came true while you were engaged in the above process, and some other thread has just tried to wake you up? You could miss the wakeup altogether and sleep longer than you had intended. Consequently, down inside code that sleeps, you typically see something such as:

if (!条件)
    日程( );
if (!condition)
    schedule(  );

通过在设置进程状态后检查我们的条件,我们可以应对所有可能的事件序列。如果我们正在等待的条件在设置进程状态之前出现,我们会在此检查中注意到,而不是真正休眠。如果此后发生唤醒,则无论我们是否实际上已经进入睡眠状态,该进程都可以运行。

By checking our condition after setting the process state, we are covered against all possible sequences of events. If the condition we are waiting for had come about before setting the process state, we notice in this check and not actually sleep. If the wakeup happens thereafter, the process is made runnable whether or not we have actually gone to sleep yet.

当然,对schedule的调用是调用调度器并让出CPU的方法。每当您调用此函数时,您都在告诉内核考虑应该运行哪个进程,并在必要时将控制权切换到该进程。所以你永远不知道时间表返回到你的代码之前需要多长时间 。

The call to schedule is, of course, the way to invoke the scheduler and yield the CPU. Whenever you call this function, you are telling the kernel to consider which process should be running and to switch control to that process if necessary. So you never know how long it will be before schedule returns to your code.

if在测试和可能调用(并从)schedule返回之后,需要进行一些清理工作。由于代码不再打算休眠,因此必须确保任务状态重置为TASK_RUNNING。如果代码是从 schedule返回的,则不需要这一步;在进程处于可运行状态之前,该函数不会返回。但是,如果由于不再需要休眠而跳过了对Schedule的调用 ,则进程状态将不正确。还需要将进程从等待队列中移除,否则可能会被多次唤醒。

After the if test and possible call to (and return from) schedule, there is some cleanup to be done. Since the code no longer intends to sleep, it must ensure that the task state is reset to TASK_RUNNING. If the code just returned from schedule, this step is unnecessary; that function does not return until the process is in a runnable state. But if the call to schedule was skipped because it was no longer necessary to sleep, the process state will be incorrect. It is also necessary to remove the process from the wait queue, or it may be awakened more than once.

手动睡眠

Manual sleeps

在以前的版本中 在 Linux 内核中,重要的睡眠需要程序员手动处理上述所有步骤。这是一个乏味的过程,涉及大量容易出错的样板代码。如果程序员愿意,他们仍然可以以这种方式编写手动睡眠; <linux/sched.h>包含所有必需的定义,并且内核源代码中包含大量示例。然而,还有一种更简单的方法。

In previous versions of the Linux kernel, nontrivial sleeps required the programmer to handle all of the above steps manually. It was a tedious process involving a fair amount of error-prone boilerplate code. Programmers can still code a manual sleep in that manner if they want to; <linux/sched.h> contains all the requisite definitions, and the kernel source abounds with examples. There is an easier way, however.

第一步是创建并初始化 等待队列条目。这通常是用这个宏完成的:

The first step is the creation and initialization of a wait queue entry. That is usually done with this macro:

DEFINE_WAIT(my_wait);
DEFINE_WAIT(my_wait);

其中name是等待队列条目变量的名称。您还可以分两步执行操作:

in which name is the name of the wait queue entry variable. You can also do things in two steps:

wait_queue_t my_wait;
init_wait(&my_wait);
wait_queue_t my_wait;
init_wait(&my_wait);

但通常更容易在DEFINE_WAIT 实现睡眠的循环顶部放置一行。

But it is usually easier to put a DEFINE_WAIT line at the top of the loop that implements your sleep.

下一步是将等待队列条目添加到队列中,并设置进程状态。这两个任务都由该函数处理:

The next step is to add your wait queue entry to the queue, and set the process state. Both of those tasks are handled by this function:

无效prepare_to_wait(wait_queue_head_t *队列,
                     wait_queue_t *等待,
                     整数状态);
void prepare_to_wait(wait_queue_head_t *queue,
                     wait_queue_t *wait,
                     int state);

这里,queuewait分别是等待队列头和进程入口。state是进程的新状态;它应该是 TASK_INTERRUPTIBLE(对于可中断睡眠,这通常是您想要的)或TASK_UNINTERRUPTIBLE (对于不间断睡眠)

Here, queue and wait are the wait queue head and the process entry, respectively. state is the new state for the process; it should be either TASK_INTERRUPTIBLE (for interruptible sleeps, which is usually what you want) or TASK_UNINTERRUPTIBLE (for uninterruptible sleeps).

调用prepare_to_wait之后,进程可以调用 schedule——在检查确定它仍然需要等待之后。一旦时间表返回,就到了清理时间。该任务也由一个特殊函数处理:

After calling prepare_to_wait, the process can call schedule—after it has checked to be sure it still needs to wait. Once schedule returns, it is cleanup time. That task, too, is handled by a special function:

void finish_wait(wait_queue_head_t *队列, wait_queue_t *等待);
void finish_wait(wait_queue_head_t *queue, wait_queue_t *wait);

此后,您的代码可以测试其状态并查看是否需要再次等待。

Thereafter, your code can test its state and see if it needs to wait again.

我们早就该举个例子了。之前我们研究了 scullpiperead方法,它使用 wait_event。同一驱动程序中的write方法使用prepare_to_waitfinish_wait进行等待。通常,您不会以这种方式在单个驱动程序中混合方法,但我们这样做是为了能够展示处理睡眠的两种方法。

We are far past due for an example. Previously we looked at the read method for scullpipe, which uses wait_event. The write method in the same driver does its waiting with prepare_to_wait and finish_wait, instead. Normally you would not mix methods within a single driver in this way, but we did so in order to be able to show both ways of handling sleeps.

首先,为了完整起见,让我们看看write方法本身:

First, for completeness, let's look at the write method itself:

/* 有多少可用空间?*/
静态 int spacefree(struct scull_pipe *dev)
{
    if (dev->rp == dev->wp)
        返回 dev->buffersize - 1;
    return ((dev->rp + dev->buffersize - dev->wp) % dev->buffersize) - 1;
}

static ssize_t scull_p_write(struct file *filp, const char _ _user *buf, size_t count,
                loff_t *f_pos)
{
    struct scull_pipe *dev = filp->private_data;
    整数结果;

    if (down_interruptible(&dev->sem))
        返回-ERESTARTSYS;

    /* 确保有空间可写 */
    结果 = scull_getwritespace(dev, filp);
    如果(结果)
        返回结果;/* scull_getwritespace 被调用(&dev->sem) */

    /* 好吧,空间就在那里,接受一些东西 */
    计数 = min(计数, (size_t)spacefree(dev));
    if (dev->wp >= dev->rp)
        计数 = min(计数, (size_t)(dev->end - dev->wp)); /* 到缓冲区末尾 */
    else /* 写指针已经回绕,填充到 rp-1 */
        计数 = min(计数, (size_t)(dev->rp - dev->wp - 1));
    PDEBUG("将从 %p\n 接受 %li 字节到 %p", (long)count, dev->wp, buf);
    if (copy_from_user(dev->wp, buf, count)) {
        向上 (&dev->sem);
        返回-EFAULT;
    }
    dev->wp += 计数;
    if (dev->wp == dev->end)
        dev->wp = dev->buffer; /* 包裹 */
    向上(&dev->sem);

    /* 最后,唤醒所有读者 */
    wake_up_interruptible(&dev->inq); /* 阻塞在 read( ) 和 select( ) */

    /* 和信号异步读取器,在第 5 章后面解释 */
    if (dev->async_queue)
        Kill_fasync(&dev->async_queue, SIGIO, POLL_IN);
    PDEBUG("\"%s\" 已写入 %li 个字节\n",current->comm, (long)count);
    返回计数;
}
/* How much space is free? */
static int spacefree(struct scull_pipe *dev)
{
    if (dev->rp =  = dev->wp)
        return dev->buffersize - 1;
    return ((dev->rp + dev->buffersize - dev->wp) % dev->buffersize) - 1;
}

static ssize_t scull_p_write(struct file *filp, const char _ _user *buf, size_t count,
                loff_t *f_pos)
{
    struct scull_pipe *dev = filp->private_data;
    int result;

    if (down_interruptible(&dev->sem))
        return -ERESTARTSYS;

    /* Make sure there's space to write */
    result = scull_getwritespace(dev, filp);
    if (result)
        return result; /* scull_getwritespace called up(&dev->sem) */

    /* ok, space is there, accept something */
    count = min(count, (size_t)spacefree(dev));
    if (dev->wp >= dev->rp)
        count = min(count, (size_t)(dev->end - dev->wp)); /* to end-of-buf */
    else /* the write pointer has wrapped, fill up to rp-1 */
        count = min(count, (size_t)(dev->rp - dev->wp - 1));
    PDEBUG("Going to accept %li bytes to %p from %p\n", (long)count, dev->wp, buf);
    if (copy_from_user(dev->wp, buf, count)) {
        up (&dev->sem);
        return -EFAULT;
    }
    dev->wp += count;
    if (dev->wp =  = dev->end)
        dev->wp = dev->buffer; /* wrapped */
    up(&dev->sem);

    /* finally, awake any reader */
    wake_up_interruptible(&dev->inq);  /* blocked in read(  ) and select(  ) */

    /* and signal asynchronous readers, explained late in chapter 5 */
    if (dev->async_queue)
        kill_fasync(&dev->async_queue, SIGIO, POLL_IN);
    PDEBUG("\"%s\" did write %li bytes\n",current->comm, (long)count);
    return count;
}

这段代码看起来与read方法类似,只是我们将休眠的代码推送到了一个名为 scull_getwritespace的单独函数中 。它的工作是确保缓冲区中有用于新数据的空间,如果需要,则休眠直到该空间可用。一旦有空间, scull_p_write就可以简单地将用户数据复制到那里,调整指针,并唤醒任何可能一直在等待读取数据的进程。

This code looks similar to the read method, except that we have pushed the code that sleeps into a separate function called scull_getwritespace . Its job is to ensure that there is space in the buffer for new data, sleeping if need be until that space comes available. Once the space is there, scull_p_write can simply copy the user's data there, adjust the pointers, and wake up any processes that may have been waiting to read data.

处理的代码 实际睡眠时间为:

The code that handles the actual sleep is:

/* 等待写入空间;调用者必须持有设备信号量。在
 * 错误信号量将在返回前释放。*/
static int scull_getwritespace(struct scull_pipe *dev, struct file *filp)
{
    while (spacefree(dev) == 0) { /* 满 */
        DEFINE_WAIT(等待);
        
        向上(&dev->sem);
        if (filp->f_flags & O_NONBLOCK)
            返回-EAGAIN;
        PDEBUG("\"%s\" 写入:要睡觉\n",current->comm);
        prepare_to_wait(&dev->outq, &wait, TASK_INTERRUPTIBLE);
        如果(空间自由(dev)== 0)
            日程( );
        finish_wait(&dev->outq, &wait);
        if (signal_pending(当前))
            返回-ERESTARTSYS;/* signal: 告诉fs层处理它 */
        if (down_interruptible(&dev->sem))
            返回-ERESTARTSYS;
    }
    返回0;
}
/* Wait for space for writing; caller must hold device semaphore.  On
 * error the semaphore will be released before returning. */
static int scull_getwritespace(struct scull_pipe *dev, struct file *filp)
{
    while (spacefree(dev) =  = 0) { /* full */
        DEFINE_WAIT(wait);
        
        up(&dev->sem);
        if (filp->f_flags & O_NONBLOCK)
            return -EAGAIN;
        PDEBUG("\"%s\" writing: going to sleep\n",current->comm);
        prepare_to_wait(&dev->outq, &wait, TASK_INTERRUPTIBLE);
        if (spacefree(dev) =  = 0)
            schedule(  );
        finish_wait(&dev->outq, &wait);
        if (signal_pending(current))
            return -ERESTARTSYS; /* signal: tell the fs layer to handle it */
        if (down_interruptible(&dev->sem))
            return -ERESTARTSYS;
    }
    return 0;
}

再次注意包含while循环。如果没有睡眠的空间可用,则该函数简单地返回。否则,它必须丢弃设备信号量并等待。该代码使用DEFINE_WAIT设置等待队列条目,并使用prepare_to_wait为实际睡眠做好准备。然后是对缓冲区的强制性检查;我们必须处理输入后缓冲区中的空间变得可用的情况while循环(并丢弃信号量),但在我们将自己放入等待队列之前。如果没有这种检查,如果读取器进程能够在那时完全清空缓冲区,我们可能会错过唯一一次唤醒并永远休眠。当我们确信我们必须睡觉后,我们可以调用 Schedule

Note once again the containing while loop. If space is available without sleeping, this function simply returns. Otherwise, it must drop the device semaphore and wait. The code uses DEFINE_WAIT to set up a wait queue entry and prepare_to_wait to get ready for the actual sleep. Then comes the obligatory check on the buffer; we must handle the case in which space becomes available in the buffer after we have entered the while loop (and dropped the semaphore) but before we put ourselves onto the wait queue. Without that check, if the reader processes were able to completely empty the buffer in that time, we could miss the only wakeup we would ever get and sleep forever. Having satisfied ourselves that we must sleep, we can call schedule.

值得再次审视这种情况:如果唤醒发生在语句中的测试和对Scheduleif的调用 之间,会发生什么?在这种情况下,一切都很好。唤醒会将进程状态重置为并 调度返回——尽管不一定立即返回。只要测试发生在进程将自身放入等待队列并更改其状态之后,事情就会正常进行。TASK_RUNNING

It is worth looking again at this case: what happens if the wakeup happens between the test in the if statement and the call to schedule? In that case, all is well. The wakeup resets the process state to TASK_RUNNING and schedule returns—although not necessarily right away. As long as the test happens after the process has put itself on the wait queue and changed its state, things will work.

为了完成,我们调用finish_wait对signal_pending的调用 告诉我们是否被信号唤醒;如果是这样,我们需要返回给用户并让他们稍后再试。否则,我们重新获取信号量,并像往常一样再次测试可用空间。

To finish up, we call finish_wait. The call to signal_pending tells us whether we were awakened by a signal; if so, we need to return to the user and let them try again later. Otherwise, we reacquire the semaphore, and test again for free space as usual.

独家等待

Exclusive waits

我们已经看到当进程调用 wake_up时 在等待队列上,所有在该队列上等待的进程都变得可运行。在许多情况下,这是正确的行为。然而,在其他情况下,可以提前知道只有一个被唤醒的进程将成功获得所需的资源,而其余进程将不得不再次休眠。然而,这些进程中的每一个都必须获得处理器,争夺资源(以及任何控制锁),并显式地返回睡眠状态。如果等待队列中的进程数量很大,这种“惊群”行为会严重降低系统的性能。

We have seen that when a process calls wake_up on a wait queue, all processes waiting on that queue are made runnable. In many cases, that is the correct behavior. In others, however, it is possible to know ahead of time that only one of the processes being awakened will succeed in obtaining the desired resource, and the rest will simply have to sleep again. Each one of those processes, however, has to obtain the processor, contend for the resource (and any governing locks), and explicitly go back to sleep. If the number of processes in the wait queue is large, this "thundering herd" behavior can seriously degrade the performance of the system.

为了应对现实世界中的雷群问题,内核 开发人员向内核添加了“独占等待”选项。独占等待的行为与正常睡眠非常相似,但有两个重要的区别:

In response to real-world thundering herd problems, the kernel developers added an "exclusive wait" option to the kernel. An exclusive wait acts very much like a normal sleep, with two important differences:

  • 当等待队列条目有WQ_FLAG_EXCLUSIVE 标志设置后,它被添加到等待队列的末尾。相反,没有该标志的条目将添加到开头。

  • When a wait queue entry has the WQ_FLAG_EXCLUSIVE flag set, it is added to the end of the wait queue. Entries without that flag are, instead, added to the beginning.

  • 当在等待队列上调用wake_upWQ_FLAG_EXCLUSIVE时,它会在唤醒第一个设置了该标志的进程后停止。

  • When wake_up is called on a wait queue, it stops after waking the first process that has the WQ_FLAG_EXCLUSIVE flag set.

最终结果是,执行独占等待的进程一次被唤醒一个,顺序有序,并且不会产生惊群。然而,内核每次仍然会唤醒所有非独占的等待者。

The end result is that processes performing exclusive waits are awakened one at a time, in an orderly manner, and do not create thundering herds. The kernel still wakes up all nonexclusive waiters every time, however.

如果满足两个条件,则值得考虑在驱动程序中使用独占等待:您预计资源会出现严重争用,并且唤醒单个进程足以在资源可用时完全消耗该资源。例如,独占等待对于 Apache Web 服务器来说效果很好;当一个新连接到来时,系统上的一个(通常是多个)Apache 进程应该被唤醒来处理它。然而,我们没有在scullpipe驱动程序中使用独占等待;很少看到读取器争夺数据(或写入器争夺缓冲区空间),并且我们无法知道一个读取器一旦被唤醒,将消耗所有可用数据。

Employing exclusive waits within a driver is worth considering if two conditions are met: you expect significant contention for a resource, and waking a single process is sufficient to completely consume the resource when it becomes available. Exclusive waits work well for the Apache web server, for example; when a new connection comes in, exactly one of the (often many) Apache processes on the system should wake up to deal with it. We did not use exclusive waits in the scullpipe driver, however; it is rare to see readers contending for data (or writers for buffer space), and we cannot know that one reader, once awakened, will consume all of the available data.

将进程置于可中断等待状态只需调用 prepare_to_wait_exclusive即可:

Putting a process into an interruptible wait is a simple matter of calling prepare_to_wait_exclusive:

无效prepare_to_wait_exclusive(wait_queue_head_t *队列,
                               wait_queue_t *等待,
                               整数状态);
void prepare_to_wait_exclusive(wait_queue_head_t *queue,
                               wait_queue_t *wait,
                               int state);

当使用此调用代替prepare_to_wait时,会在等待队列条目中设置“独占”标志,并将进程添加到等待队列的末尾。请注意,无法使用 wait_event及其变体执行独占等待。

This call, when used in place of prepare_to_wait, sets the "exclusive" flag in the wait queue entry and adds the process to the end of the wait queue. Note that there is no way to perform exclusive waits with wait_event and its variants.

起床的细节

The details of waking up

我们所呈现的唤醒过程的视图比内核内部实际发生的情况更简单。唤醒进程时产生的实际行为由等待队列条目中的函数控制。默认唤醒函数[ 3 ]将进程设置为可运行状态,并且如果该进程具有更高的优先级,则可能会执行到该进程的上下文切换。设备驱动程序永远不需要提供不同的唤醒功能;如果您的情况被证明是例外,请参阅<linux/wait.h>以获取有关如何执行此操作的信息。

The view we have presented of the wakeup process is simpler than what really happens inside the kernel. The actual behavior that results when a process is awakened is controlled by a function in the wait queue entry. The default wakeup function[3] sets the process into a runnable state and, possibly, performs a context switch to that process if it has a higher priority. Device drivers should never need to supply a different wake function; should yours prove to be the exception, see <linux/wait.h> for information on how to do it.

我们还没有看到wake_up的所有变体。大多数驱动程序编写者从不需要其他驱动程序,但是,为了完整起见,这里是全套:

We have not yet seen all the variations of wake_up. Most driver writers never need the others, but, for completeness, here is the full set:

wake_up(wait_queue_head_t *queue);

wake_up_interruptible(wait_queue_head_t *queue);
wake_up(wait_queue_head_t *queue);

wake_up_interruptible(wait_queue_head_t *queue);

wake_up唤醒队列中未处于独占等待状态的每个进程,以及一个独占等待者(如果存在)。 wake_up_interruptible执行相同的操作,但它会跳过不可中断睡眠中的进程。这些函数可以在返回之前导致一个或多个唤醒的进程被调度(尽管如果从原子上下文调用它们则不会发生这种情况)。

wake_up awakens every process on the queue that is not in an exclusive wait, and exactly one exclusive waiter, if it exists. wake_up_interruptible does the same, with the exception that it skips over processes in an uninterruptible sleep. These functions can, before returning, cause one or more of the processes awakened to be scheduled (although this does not happen if they are called from an atomic context).

wake_up_nr(wait_queue_head_t *queue, int nr);

wake_up_interruptible_nr(wait_queue_head_t *queue, int nr);
wake_up_nr(wait_queue_head_t *queue, int nr);

wake_up_interruptible_nr(wait_queue_head_t *queue, int nr);

这些函数的执行方式与wake_up类似,只不过它们可以唤醒nr独占的等待者,而不仅仅是一个。请注意,传递 0 被解释为要求 唤醒所有独占服务员,而不是都不唤醒。

These functions perform similarly to wake_up, except they can awaken up to nr exclusive waiters, instead of just one. Note that passing 0 is interpreted as asking for all of the exclusive waiters to be awakened, rather than none of them.

wake_up_all(wait_queue_head_t *queue);

wake_up_interruptible_all(wait_queue_head_t *queue);
wake_up_all(wait_queue_head_t *queue);

wake_up_interruptible_all(wait_queue_head_t *queue);

这种形式的wake_up会唤醒所有进程,无论它们是否正在执行独占等待(尽管可中断形式仍然会跳过执行不可中断等待的进程)。

This form of wake_up awakens all processes whether they are performing an exclusive wait or not (though the interruptible form still skips processes doing uninterruptible waits).

wake_up_interruptible_sync(wait_queue_head_t *queue);
wake_up_interruptible_sync(wait_queue_head_t *queue);

通常,被唤醒的进程可能会抢占当前进程,并在wake_up返回之前被调度到处理器中。换句话说,对wake_up的调用可能不是原子的。如果调用wake_up的进程在原子上下文中运行(例如,它持有自旋锁,或者是中断处理程序),则不会发生这种重新调度。通常情况下,这种保护就足够了。但是,如果您需要明确要求此时不被调度出处理器,则可以使用wake_up_interruptible的“sync”变体。当调用者无论如何都要重新安排时,最常使用此函数,并且首先简单地完成剩下的少量工作会更有效。

Normally, a process that is awakened may preempt the current process and be scheduled into the processor before wake_up returns. In other words, a call to wake_up may not be atomic. If the process calling wake_up is running in an atomic context (it holds a spinlock, for example, or is an interrupt handler), this rescheduling does not happen. Normally, that protection is adequate. If, however, you need to explicitly ask to not be scheduled out of the processor at this time, you can use the "sync" variant of wake_up_interruptible. This function is most often used when the caller is about to reschedule anyway, and it is more efficient to simply finish what little work remains first.

如果第一次阅读时上述所有内容并不完全清楚,请不要担心。很少有驱动程序需要调用除 wake_up_interruptible之外的任何东西。

If all of the above is not entirely clear on a first reading, don't worry. Very few drivers ever need to call anything except wake_up_interruptible.

古代历史:sleep_on

Ancient history: sleep_on

如果您花时间挖掘内核源代码,您可能会遇到我们迄今为止忽略讨论的两个函数:

If you spend any time digging through the kernel source, you will likely encounter two functions that we have neglected to discuss so far:

无效 sleep_on(wait_queue_head_t *queue);
无效interruptible_sleep_on(wait_queue_head_t *队列);
void sleep_on(wait_queue_head_t *queue);
void interruptible_sleep_on(wait_queue_head_t *queue);

正如您所期望的,这些函数无条件地将当前进程置于给定的睡眠状态queue。但是,这些函数已被强烈弃用,并且您永远不应该使用它们。如果你仔细想想,问题就很明显了:sleep_on 没有提供任何方法来防止竞争条件。当你的代码决定它必须睡眠和sleep_on实际影响睡眠之间总是有一个窗口。错过了在该窗口期间到达的唤醒。因此,调用sleep_on的代码永远不会完全安全。

As you might expect, these functions unconditionally put the current process to sleep on the given queue. These functions are strongly deprecated, however, and you should never use them. The problem is obvious if you think about it: sleep_on offers no way to protect against race conditions. There is always a window between when your code decides it must sleep and when sleep_on actually effects that sleep. A wakeup that arrives during that window is missed. For this reason, code that calls sleep_on is never entirely safe.

当前的计划要求在不久的将来从内核中删除sleep_on及其变体(有几种我们尚未显示的超时形式)。

Current plans call for sleep_on and its variants (there are a couple of time-out forms we haven't shown) to be removed from the kernel in the not-too-distant future.

测试 Sscllpipe 驱动程序

Testing the Scullpipe Driver

我们已经看到如何scullpipe驱动程序 实现阻塞 I/O。如果您想尝试一下,可以在本书其余示例中找到该驱动程序的源代码。通过打开两个窗口可以看到正在执行的阻塞 I/O。第一个可以运行诸如 之类的命令 cat /dev/scullpipe。然后,如果您在另一个窗口中将文件复制到/dev/scullpipe,您应该会看到该文件的内容出现在第一个窗口中。

We have seen how the scullpipe driver implements blocking I/O. If you wish to try it out, the source to this driver can be found with the rest of the book examples. Blocking I/O in action can be seen by opening two windows. The first can run a command such as cat /dev/scullpipe. If you then, in another window, copy a file to /dev/scullpipe, you should see that file's contents appear in the first window.

测试非阻塞活动比较棘手,因为 shell 可用的传统程序不执行非阻塞操作。杂项程序 源目录包含以下简单程序,名为nbtest ,用于测试非阻塞操作。它所做的只是使用非阻塞 I/O 并在重试之间进行延迟,将输入复制到输出。延迟时间通过命令行传递,默认为一秒。

Testing nonblocking activity is trickier, because the conventional programs available to a shell don't perform nonblocking operations. The misc-progs source directory contains the following simple program, called nbtest , for testing nonblocking operations. All it does is copy its input to its output, using nonblocking I/O and delaying between retries. The delay time is passed on the command line and is one second by default.

int main(int argc, char **argv)
{
    int 延迟 = 1, n, m = 0;

    如果(参数 > 1)
        延迟=atoi(argv[1]);
    fcntl(0, F_SETFL, fcntl(0,F_GETFL) | O_NONBLOCK); /* 标准输入 */
    fcntl(1, F_SETFL, fcntl(1,F_GETFL) | O_NONBLOCK); /* 标准输出 */

    而(1){
        n = 读取(0, 缓冲区, 4096);
        如果 (n >= 0)
            m = 写入(1, 缓冲区, n);
        if ((n < 0 || m < 0) && (errno != EAGAIN))
            休息;
        睡眠(延迟);
    }
    perror(n < 0 ? "stdin" : "stdout");
    退出(1);
}
int main(int argc, char **argv)
{
    int delay = 1, n, m = 0;

    if (argc > 1)
        delay=atoi(argv[1]);
    fcntl(0, F_SETFL, fcntl(0,F_GETFL) | O_NONBLOCK); /* stdin */
    fcntl(1, F_SETFL, fcntl(1,F_GETFL) | O_NONBLOCK); /* stdout */

    while (1) {
        n = read(0, buffer, 4096);
        if (n >= 0)
            m = write(1, buffer, n);
        if ((n < 0 || m < 0) && (errno != EAGAIN))
            break;
        sleep(delay);
    }
    perror(n < 0 ? "stdin" : "stdout");
    exit(1);
}

如果您在进程跟踪实用程序(例如 strace,可以看到每次操作的成功或失败, 根据 尝试操作时数据是否可用。

If you run this program under a process tracing utility such as strace, you can see the success or failure of each operation, depending on whether data is available when the operation is tried.

轮询并选择

poll and select

使用非阻塞 I/O 的应用程序 也经常使用 pollselectepoll 系统调用。轮询选择epoll具有本质上相同的功能:每个都允许进程确定它是否可以在不阻塞的情况下读取或写入一个或多个打开的文件。这些调用还可以阻塞进程,直到给定的一组文件描述符中的任何一个可供读取或写入。因此,它们通常用于必须使用多个输入或输出流而不会卡在其中任何一个流的应用程序中。相同的功能由多个函数提供,因为两个函数几乎同时由两个不同的组在 Unix 中实现:select是在 BSD Unix 中引入的,而poll是 System V 解决方案。epoll 调用[ 4 ] _在 2.5.45 中添加,作为使轮询功能扩展到数千个文件描述符的一种方式。

Applications that use nonblocking I/O often use the poll, select, and epoll system calls as well. poll, select, and epoll have essentially the same functionality: each allow a process to determine whether it can read from or write to one or more open files without blocking. These calls can also block a process until any of a given set of file descriptors becomes available for reading or writing. Therefore, they are often used in applications that must use multiple input or output streams without getting stuck on any one of them. The same functionality is offered by multiple functions, because two were implemented in Unix almost at the same time by two different groups: select was introduced in BSD Unix, whereas poll was the System V solution. The epoll call[4] was added in 2.5.45 as a way of making the polling function scale to thousands of file descriptors.

对任何这些调用的支持都需要设备驱动程序的支持。这种支持(对于所有三个调用)是通过驱动程序的poll方法提供的。该方法具有以下原型:

Support for any of these calls requires support from the device driver. This support (for all three calls) is provided through the driver's poll method. This method has the following prototype:

unsigned int (*poll) (struct file *filp, poll_table *wait);
unsigned int (*poll) (struct file *filp, poll_table *wait);

每当用户空间程序执行 涉及与驱动程序关联的文件描述符的pollselectepoll系统调用时,都会调用驱动程序方法。设备方法负责这两个步骤:

The driver method is called whenever the user-space program performs a poll, select, or epoll system call involving a file descriptor associated with the driver. The device method is in charge of these two steps:

  1. 对一个或多个等待队列调用 poll_wait,这可能表明轮询状态发生变化。如果当前没有可用于 I/O 的文件描述符,则内核会使进程在等待队列上等待传递给系统调用的所有文件描述符。

  2. Call poll_wait on one or more wait queues that could indicate a change in the poll status. If no file descriptors are currently available for I/O, the kernel causes the process to wait on the wait queues for all file descriptors passed to the system call.

  3. 返回一个位掩码,描述可以立即执行而不会阻塞的操作(如果有)。

  4. Return a bit mask describing the operations (if any) that could be immediately performed without blocking.

这两种操作通常都很简单,并且从一个驱动程序到另一个驱动程序看起来都非常相似。然而,它们依赖于只有驱动程序才能提供的信息,因此必须由每个驱动程序单独实现。

Both of these operations are usually straightforward and tend to look very similar from one driver to the next. They rely, however, on information that only the driver can provide and, therefore, must be implemented individually by each driver.

该结构是pollpoll_table方法 的第二个参数,在内核中用于实现pollselectepoll调用;它在<linux/poll.h>中声明,驱动程序源必须包含它。驱动程序编写者不需要了解其内部的任何信息,并且必须将其用作不透明对象;它被传递给驱动程序方法,以便驱动程序可以将其加载到每个可以唤醒进程并更改轮询 操作状态的等待队列中。 驱动程序通过调用函数poll_wait将等待队列添加到结构中:poll_table

The poll_table structure, the second argument to the poll method, is used within the kernel to implement the poll, select, and epoll calls; it is declared in <linux/poll.h>, which must be included by the driver source. Driver writers do not need to know anything about its internals and must use it as an opaque object; it is passed to the driver method so that the driver can load it with every wait queue that could wake up the process and change the status of the poll operation. The driver adds a wait queue to the poll_table structure by calling the function poll_wait:

void poll_wait(struct file *, wait_queue_head_t *, poll_table *);
 void poll_wait (struct file *, wait_queue_head_t *, poll_table *);

poll方法执行的第二个任务是返回描述哪些操作可以立即完成的位掩码;这也很简单。例如,如果设备有可用数据,则读取 将在不休眠的情况下完成;poll方法应该指示这种情况。几个标志(通过<linux/poll.h>定义)用于指示可能的操作:

The second task performed by the poll method is returning the bit mask describing which operations could be completed immediately; this is also straightforward. For example, if the device has data available, a read would complete without sleeping; the poll method should indicate this state of affairs. Several flags (defined via <linux/poll.h>) are used to indicate the possible operations:

POLLIN
POLLIN

如果可以无阻塞地读取设备,则必须设置该位。

This bit must be set if the device can be read without blocking.

POLLRDNORM
POLLRDNORM

如果“正常”数据可供读取,则必须设置该位。返回一个可读设备 (POLLIN | POLLRDNORM)

This bit must be set if "normal" data is available for reading. A readable device returns (POLLIN | POLLRDNORM).

POLLRDBAND
POLLRDBAND

该位指示带外数据可用于从设备读取。它目前仅在 Linux 内核中的一处使用(DECnet 代码),并且通常不适用于设备驱动程序。

This bit indicates that out-of-band data is available for reading from the device. It is currently used only in one place in the Linux kernel (the DECnet code) and is not generally applicable to device drivers.

POLLPRI
POLLPRI

高优先级数据(带外)可以无阻塞地读取。该位导致 select报告文件上发生了异常情况,因为select将带外数据报告为异常情况。

High-priority data (out-of-band) can be read without blocking. This bit causes select to report that an exception condition occurred on the file, because select reports out-of-band data as an exception condition.

POLLHUP
POLLHUP

当读取此设备的进程看到文件结尾时,驱动程序必须设置POLLHUP(挂起)。调用select 的进程被告知该设备是可读的,如select 功能所指示的。

When a process reading this device sees end-of-file, the driver must set POLLHUP (hang-up). A process calling select is told that the device is readable, as dictated by the select functionality.

POLLERR
POLLERR

设备上出现错误情况。当调用poll时 ,设备被报告为可读可写,因为读和写都返回错误代码而不会阻塞。

An error condition has occurred on the device. When poll is invoked, the device is reported as both readable and writable, since both read and write return an error code without blocking.

POLLOUT
POLLOUT

如果可以无阻塞地写入设备,则在返回值中设置该位。

This bit is set in the return value if the device can be written to without blocking.

POLLWRNORM
POLLWRNORM

该位与 的含义相同POLLOUT,有时实际上是同一个数字。返回一个可写设备(POLLOUT | POLLWRNORM)

This bit has the same meaning as POLLOUT, and sometimes it actually is the same number. A writable device returns (POLLOUT | POLLWRNORM).

POLLWRBAND
POLLWRBAND

与 一样 POLLRDBAND,该位表示可以将非零优先级的数据写入设备。只有poll的数据报实现 使用该位,因为数据报可以传输带外数据。

Like POLLRDBAND, this bit means that data with nonzero priority can be written to the device. Only the datagram implementation of poll uses this bit, since a datagram can transmit out-of-band data.

值得重复的是,POLLRDBAND并且POLLWRBAND仅对与套接字关联的文件描述符有意义:设备驱动程序通常不会使用这些标志。

It's worth repeating that POLLRDBAND and POLLWRBAND are meaningful only with file descriptors associated with sockets: device drivers won't normally use these flags.

poll的描述占用了大量的篇幅,而这些内容在实践中使用起来相对简单。考虑poll方法 的scullpipe实现:

The description of poll takes up a lot of space for something that is relatively simple to use in practice. Consider the scullpipe implementation of the poll method:

静态无符号 int scull_p_poll(结构文件 *filp, poll_table *wait)
{
    struct scull_pipe *dev = filp->private_data;
    无符号整型掩码=0;

    /*
     * 缓冲区是循环的;它被认为是完整的
     * 如果“wp”就在“rp”后面并且为空,如果
     * 两个相等。
     */
    向下(&dev->sem);
    poll_wait(filp, &dev->inq, 等待);
    poll_wait(filp, &dev->outq, 等待);
    if (dev->rp != dev->wp)
        面具 |= 波林 | 民调规范;/* 可读 */
    如果(空间自由(dev))
        掩码|= 查询| 轮询规范;/* 可写 */
    向上(&dev->sem);
    返回掩码;
}
static unsigned int scull_p_poll(struct file *filp, poll_table *wait)
{
    struct scull_pipe *dev = filp->private_data;
    unsigned int mask = 0;

    /*
     * The buffer is circular; it is considered full
     * if "wp" is right behind "rp" and empty if the
     * two are equal.
     */
    down(&dev->sem);
    poll_wait(filp, &dev->inq,  wait);
    poll_wait(filp, &dev->outq, wait);
    if (dev->rp != dev->wp)
        mask |= POLLIN | POLLRDNORM;    /* readable */
    if (spacefree(dev))
        mask |= POLLOUT | POLLWRNORM;   /* writable */
    up(&dev->sem);
    return mask;
}

此代码只是将两个scullpipe等待队列添加到 中poll_table,然后根据数据是否可以读取或写入来设置适当的掩码位。

This code simply adds the two scullpipe wait queues to the poll_table, then sets the appropriate mask bits depending on whether data can be read or written.

如图所示的轮询代码 缺少文件结束支持,因为 scullpipe不支持文件结束条件。对于大多数真实设备,如果没有(或将有)更多数据可用,则poll方法应该返回。POLLHUP如果调用者使用 select系统调用,则该文件将报告为可读。无论使用poll还是select,应用程序都知道它可以调用read 而无需永远等待,并且 read方法返回,0以发出文件结束信号。

The poll code as shown is missing end-of-file support, because scullpipe does not support an end-of-file condition. For most real devices, the poll method should return POLLHUP if no more data is (or will become) available. If the caller used the select system call, the file is reported as readable. Regardless of whether poll or select is used, the application knows that it can call read without waiting forever, and the read method returns, 0 to signal end-of-file.

例如,对于真正的 FIFO,当所有写入者都关闭文件时,读取器会看到文件结束符,而在scullpipe中,读取器永远不会看到文件结束符。行为有所不同,因为 FIFO 旨在成为两个进程之间的通信通道,而scullpipe是一个垃圾桶,只要有至少一个读取器,每个人都可以在其中放入数据。此外,重新实现内核中已有的功能是没有意义的,因此我们选择在示例中实现不同的行为。

With real FIFOs, for example, the reader sees an end-of-file when all the writers close the file, whereas in scullpipe the reader never sees end-of-file. The behavior is different because a FIFO is intended to be a communication channel between two processes, while scullpipe is a trash can where everyone can put data as long as there's at least one reader. Moreover, it makes no sense to reimplement what is already available in the kernel, so we chose to implement a different behavior in our example.

以与 FIFO 相同的方式实现文件结束意味着dev->nwritersreadpoll中进行检查,如果没有进程打开设备进行写入,则报告文件结束(如上所述)。但不幸的是,在这种实现中,如果读取器在写入器之前打开scullpipe设备,它将看到文件结尾,而没有机会等待数据。解决这个问题的最好方法是像真正的 FIFO 一样在open中实现阻塞;这项任务留给读者作为练习。

Implementing end-of-file in the same way as FIFOs do would mean checking dev->nwriters, both in read and in poll, and reporting end-of-file (as just described) if no process has the device opened for writing. Unfortunately, though, with this implementation, if a reader opened the scullpipe device before the writer, it would see end-of-file without having a chance to wait for data. The best way to fix this problem would be to implement blocking within open like real FIFOs do; this task is left as an exercise for the reader.

读写交互

Interaction with read and write

pollselect调用的目的是提前确定 I/O 操作是否会阻塞。在这方面,它们补充了 readwrite。更重要的是, pollselect很有用,因为它们让应用程序同时等待多个数据流,尽管我们在scull示例中没有利用此功能。

The purpose of the poll and select calls is to determine in advance if an I/O operation will block. In that respect, they complement read and write. More important, poll and select are useful, because they let the application wait simultaneously for several data streams, although we are not exploiting this feature in the scull examples.

正确实现这三个调用对于应用程序正常工作至关重要:尽管已经或多或少地陈述了以下规则,但我们在这里总结它们。

A correct implementation of the three calls is essential to make applications work correctly: although the following rules have more or less already been stated, we summarize them here.

从设备读取数据

Reading data from the device

  • 如果输入缓冲区中有数据, 即使可用数据少于应用程序请求的数据,读取调用也应立即返回,没有明显的延迟,并且驱动程序确信剩余数据很快就会到达。如果出于任何原因方便的话,您总是可以返回比要求的数据少的数据(我们在 scull中做到了),只要您返回至少一个字节。在这种情况下, poll应该返回POLLIN|POLLRDNORM

  • If there is data in the input buffer, the read call should return immediately, with no noticeable delay, even if less data is available than the application requested, and the driver is sure the remaining data will arrive soon. You can always return less data than you're asked for if this is convenient for any reason (we did it in scull), provided you return at least one byte. In this case, poll should return POLLIN|POLLRDNORM.

  • 如果输入缓冲区中没有数据,则默认情况下 读取必须阻塞,直到存在至少一个字节。O_NONBLOCK另一方面, 如果设置了,则 read立即返回,返回值为-EAGAIN(尽管某些旧版本的 System V 0在这种情况下返回)。在这些情况下, poll必须报告设备不可读,直到至少有一个字节到达。一旦缓冲区中有一些数据,我们就会回到之前的情况。

  • If there is no data in the input buffer, by default read must block until at least one byte is there. If O_NONBLOCK is set, on the other hand, read returns immediately with a return value of -EAGAIN (although some old versions of System V return 0 in this case). In these cases, poll must report that the device is unreadable until at least one byte arrives. As soon as there is some data in the buffer, we fall back to the previous case.

  • 如果我们位于文件末尾,则 read应立即返回,返回值为0,与 无关O_NONBLOCK在这种情况下,民意调查应该报告 POLLHUP

  • If we are at end-of-file, read should return immediately with a return value of 0, independent of O_NONBLOCK. poll should report POLLHUP in this case.

写入设备

Writing to the device

  • 如果输出缓冲区中有空间,则 write应立即返回。它可以接受比调用请求更少的数据,但它必须至少接受一个字节。在这种情况下, poll通过返回 来报告设备可写 POLLOUT|POLLWRNORM

  • If there is space in the output buffer, write should return without delay. It can accept less data than the call requested, but it must accept at least one byte. In this case, poll reports that the device is writable by returning POLLOUT|POLLWRNORM.

  • 如果输出缓冲区已满,默认情况下写入会阻塞,直到释放一些空间。如果O_NONBLOCK设置, 则 write立即返回,返回值为-EAGAIN(旧的 System V Unices returned 0)。在这些情况下,poll应报告该文件不可写。另一方面,如果设备无法接受更多数据,则 write返回-ENOSPC(“设备上没有剩余空间”),与 的设置无关 O_NONBLOCK

  • If the output buffer is full, by default write blocks until some space is freed. If O_NONBLOCK is set, write returns immediately with a return value of -EAGAIN (older System V Unices returned 0). In these cases, poll should report that the file is not writable. If, on the other hand, the device is not able to accept any more data, write returns -ENOSPC ("No space left on device"), independently of the setting of O_NONBLOCK.

  • 切勿在返回之前进行调用等待数据传输,即使O_NONBLOCK是清除的。这是因为许多应用程序使用select来确定 写入是否会阻塞。如果设备被报告为可写,则调用不得阻塞。如果使用该设备的程序想要确保它在输出缓冲区中排队的数据确实被传输,则驱动程序必须提供 fsync方法。例如,可移动设备应该有一个 fsync入口点。

  • Never make a write call wait for data transmission before returning, even if O_NONBLOCK is clear. This is because many applications use select to find out whether a write will block. If the device is reported as writable, the call must not block. If the program using the device wants to ensure that the data it enqueues in the output buffer is actually transmitted, the driver must provide an fsync method. For instance, a removable device should have an fsync entry point.

尽管这是一套很好的通用规则,但人们还应该认识到每种设备都是独一无二的,有时必须稍微调整规则。例如,面向记录的设备(例如磁带驱动器)无法执行部分写入。

Although this is a good set of general rules, one should also recognize that each device is unique and that sometimes the rules must be bent slightly. For example, record-oriented devices (such as tape drives) cannot execute partial writes.

刷新挂起输出

Flushing pending output

我们已经看到write 方法本身并不能满足所有数据输出需求。由同名系统调用调用的fsync函数 填补了这一空白。这个方法的原型是

We've seen how the write method by itself doesn't account for all data output needs. The fsync function, invoked by the system call of the same name, fills the gap. This method's prototype is

int (*fsync) (struct file *file, struct dentry *dentry, int datasync);
 int (*fsync) (struct file *file, struct dentry *dentry, int datasync);

如果某些应用程序需要确保数据已发送到设备,则 无论是否设置,都必须实现fsync方法。对fsyncO_NONBLOCK的调用应该仅在设备完全刷新(即输出缓冲区为空)时返回,即使这需要一些时间。该参数用于区分fsyncfdatasync 系统调用;因此,它仅对文件系统代码感兴趣,并且可以被驱动程序忽略。datasync

If some application ever needs to be assured that data has been sent to the device, the fsync method must be implemented regardless of whether O_NONBLOCK is set. A call to fsync should return only when the device has been completely flushed (i.e., the output buffer is empty), even if that takes some time. The datasync argument is used to distinguish between the fsync and fdatasync system calls; as such, it is only of interest to filesystem code and can be ignored by drivers.

fsync 方法没有什么不寻常的特征该调用对时间要求不高,因此每个设备驱动程序都可以根据作者的喜好来实现它。大多数时候,char 驱动程序NULL在其 fops. 另一方面,块设备始终使用通用的block_fsync来实现该方法,该方法依次刷新设备的所有块,等待 I/O 完成。

The fsync method has no unusual features. The call isn't time critical, so every device driver can implement it to the author's taste. Most of the time, char drivers just have a NULL pointer in their fops. Block devices, on the other hand, always implement the method with the general-purpose block_fsync, which, in turn, flushes all the blocks of the device, waiting for I/O to complete.

底层数据结构

The Underlying Data Structure

对于那些对其工作原理感兴趣的人来说,pollselect系统调用的实际实现相当简单;epoll稍微复杂一些,但构建在相同的机制上。每当用户应用程序调用 pollselectepoll_ctl时,[ 5 ]内核都会调用系统调用引用的所有文件的poll方法,并将相同的方法传递poll_table给每个文件。该poll_table 结构只是构建实际数据结构的函数的包装。该结构用于民意调查select是包含结构内存页的链接列表poll_table_entry。每个都poll_table_entry保存传递给poll_wait 的struct file和指针,以及关联的等待队列条目。对poll_wait的调用有时也会将进程添加到给定的等待队列中。整个结构必须由内核维护,以便在 pollselect返回之前可以从所有这些队列中删除该进程。wait_queue_head_t

The actual implementation of the poll and select system calls is reasonably simple, for those who are interested in how it works; epoll is a bit more complex but is built on the same mechanism. Whenever a user application calls poll, select, or epoll_ctl,[5] the kernel invokes the poll method of all files referenced by the system call, passing the same poll_table to each of them. The poll_table structure is just a wrapper around a function that builds the actual data structure. That structure, for poll and select, is a linked list of memory pages containing poll_table_entry structures. Each poll_table_entry holds the struct file and wait_queue_head_t pointers passed to poll_wait, along with an associated wait queue entry. The call to poll_wait sometimes also adds the process to the given wait queue. The whole structure must be maintained by the kernel so that the process can be removed from all of those queues before poll or select returns.

如果没有一个被轮询的驱动程序表明 I/O 可以在不阻塞的情况下发生,则轮询 调用只会休眠,直到它所在的等待队列之一(可能是多个)将其唤醒。

If none of the drivers being polled indicates that I/O can occur without blocking, the poll call simply sleeps until one of the (perhaps many) wait queues it is on wakes it up.

poll实现中有趣的是驱动程序的poll方法可以用NULL指针作为参数来调用poll_table。出现这种情况的原因有几个。如果调用poll的应用程序 提供了超时值0(表明不应进行等待),则没有理由累积等待队列,系统根本不会这样做。 在任何被轮询的驱动程序指示 I/O 可能之后,指针poll_table也会立即设置为。由于内核此时知道不会发生等待,因此它不会建立等待队列列表。NULL

What's interesting in the implementation of poll is that the driver's poll method may be called with a NULL pointer as a poll_table argument. This situation can come about for a couple of reasons. If the application calling poll has provided a timeout value of 0 (indicating that no wait should be done), there is no reason to accumulate wait queues, and the system simply does not do it. The poll_table pointer is also set to NULL immediately after any driver being polled indicates that I/O is possible. Since the kernel knows at that point that no wait will occur, it does not build up a list of wait queues.

轮询调用完成时,该poll_table结构将被释放,并且先前添加到轮询表中的所有等待队列条目(如果有)将从表及其等待队列中删除。

When the poll call completes, the poll_table structure is deallocated, and all wait queue entries previously added to the poll table (if any) are removed from the table and their wait queues.

我们试图在图6-1中展示轮询所涉及的数据结构;该图是真实数据结构的简化表示,因为它忽略了轮询表的多页性质,并且忽略了属于每个poll_table_entry. 对实际实现感兴趣的读者请查看<linux/poll.h>fs/select.c

We tried to show the data structures involved in polling in Figure 6-1; the figure is a simplified representation of the real data structures, because it ignores the multipage nature of a poll table and disregards the file pointer that is part of each poll_table_entry. The reader interested in the actual implementation is urged to look in <linux/poll.h> and fs/select.c.

poll背后的数据结构

图 6-1。poll背后的数据结构

Figure 6-1. The data structures behind poll

至此,我们就可以理解新的 epoll系统调用背后的动机了。在典型情况下,对pollselect 的调用 仅涉及少量文件描述符,因此设置数据结构的成本很小。然而,有些应用程序可以处理数千个文件描述符。到那时,在每个 I/O 操作之间建立和拆除这个数据结构就会变得极其昂贵。epoll系统调用系列允许此类应用程序设置内部内核数据 只构造一次并多次使用它。

At this point, it is possible to understand the motivation behind the new epoll system call. In a typical case, a call to poll or select involves only a handful of file descriptors, so the cost of setting up the data structure is small. There are applications out there, however, that work with thousands of file descriptors. At that point, setting up and tearing down this data structure between every I/O operation becomes prohibitively expensive. The epoll system call family allows this sort of application to set up the internal kernel data structure exactly once and to use it many times.

异步通知

Asynchronous Notification

虽然组合 阻塞和非阻塞操作以及select方法在大多数情况下足以查询设备,但我们迄今为止看到的技术无法有效地管理某些情况。

Although the combination of blocking and nonblocking operations and the select method are sufficient for querying the device most of the time, some situations aren't efficiently managed by the techniques we've seen so far.

让我们想象一个以低优先级执行长计算循环但需要尽快处理传入数据的进程。如果该过程响应某种数据采集外围设备提供的新观察结果,则它希望立即知道新数据何时可用。该应用程序可以编写为定期调用 poll来检查数据,但是,对于许多情况,有更好的方法。通过启用异步通知,该应用程序可以在数据可用时接收信号,而无需关注轮询。

Let's imagine a process that executes a long computational loop at low priority but needs to process incoming data as soon as possible. If this process is responding to new observations available from some sort of data acquisition peripheral, it would like to know immediately when new data is available. This application could be written to call poll regularly to check for data, but, for many situations, there is a better way. By enabling asynchronous notification, this application can receive a signal whenever data becomes available and need not concern itself with polling.

用户程序必须执行两个步骤才能启用来自输入文件的异步通知。首先,他们指定一个进程作为文件的“所有者”。当一个进程调用F_SETOWN 使用fcntl系统调用命令,所有者进程的进程 ID 被保存起来 filp->f_owner供以后使用。这一步对于内核知道要通知谁是必要的。为了真正启用异步通知,用户程序必须设置FASYNC 设备中的标志通过F_SETFL fcntl命令。

User programs have to execute two steps to enable asynchronous notification from an input file. First, they specify a process as the "owner" of the file. When a process invokes the F_SETOWN command using the fcntl system call, the process ID of the owner process is saved in filp->f_owner for later use. This step is necessary for the kernel to know just whom to notify. In order to actually enable asynchronous notification, the user programs must set the FASYNC flag in the device by means of the F_SETFL fcntl command.

这两个调用执行后,SIGIO每当新数据到达时,输入文件都可以请求发送信号。该信号被发送到存储在 中的进程(或进程组,如果该值为负数)filp->f_owner

After these two calls have been executed, the input file can request delivery of a SIGIO signal whenever new data arrives. The signal is sent to the process (or process group, if the value is negative) stored in filp->f_owner.

例如,用户程序中的以下代码行启用对stdin输入文件的当前进程的异步通知:

For example, the following lines of code in a user program enable asynchronous notification to the current process for the stdin input file:

信号(SIGIO,&input_handler);/* 虚拟样本;sigaction( ) 更好 */
fcntl(STDIN_FILENO, F_SETOWN, getpid( ));
oflags = fcntl(STDIN_FILENO, F_GETFL);
fcntl(STDIN_FILENO, F_SETFL, oflags | FASYNC);
signal(SIGIO, &input_handler); /* dummy sample; sigaction(  ) is better */
fcntl(STDIN_FILENO, F_SETOWN, getpid(  ));
oflags = fcntl(STDIN_FILENO, F_GETFL);
fcntl(STDIN_FILENO, F_SETFL, oflags | FASYNC);

名为asynctest 的程序 源代码中是一个简单的程序,其内容stdin如下所示。可以用来测试scullpipe的异步能力。该程序与cat类似,但不会在文件末尾终止;它只响应输入,而不响应没有输入的情况。

The program named asynctest in the sources is a simple program that reads stdin as shown. It can be used to test the asynchronous capabilities of scullpipe. The program is similar to cat but doesn't terminate on end-of-file; it responds only to input, not to the absence of input.

但请注意,并非所有设备都支持异步通知,您可以选择不提供它。应用程序通常假设异步功能仅适用于套接字和 tty。

Note, however, that not all the devices support asynchronous notification, and you can choose not to offer it. Applications usually assume that the asynchronous capability is available only for sockets and ttys.

输入通知还存在一个问题。当进程收到 时 SIGIO,它不知道哪个输入文件有新的输入要提供。如果启用多个文件来异步通知进程待处理的输入,则应用程序仍必须诉诸轮询选择 查明发生了什么。

There is one remaining problem with input notification. When a process receives a SIGIO, it doesn't know which input file has new input to offer. If more than one file is enabled to asynchronously notify the process of pending input, the application must still resort to poll or select to find out what happened.

司机的观点

The Driver's Point of View

对我们来说更相关的主题是设备驱动程序如何实现异步信号发送。下面的列表详细介绍了从内核角度来看的操作顺序:

A more relevant topic for us is how the device driver can implement asynchronous signaling. The following list details the sequence of operations from the kernel's point of view:

  1. 调用时F_SETOWN,除了将值分配给 之外,不会发生任何事情filp->f_owner

  2. When F_SETOWN is invoked, nothing happens, except that a value is assigned to filp->f_owner.

  3. F_SETFL执行打开时,会调用FASYNC驱动程序的fasync方法。每当 的值FASYNC发生更改时都会调用此方法filp->f_flags ,以通知驱动程序发生更改,以便驱动程序能够正确响应。文件打开时该标志默认被清除。我们将在本节后面讨论驱动程序方法的标准实现。

  4. When F_SETFL is executed to turn on FASYNC, the driver's fasync method is called. This method is called whenever the value of FASYNC is changed in filp->f_flags to notify the driver of the change, so it can respond properly. The flag is cleared by default when the file is opened. We'll look at the standard implementation of the driver method later in this section.

  5. 当数据到达时,必须向所有注册异步通知的进程发送信号SIGIO

  6. When data arrives, all the processes registered for asynchronous notification must be sent a SIGIO signal.

虽然实现第一步很简单(驱动程序无需执行任何操作),但其他步骤涉及维护动态数据结构以跟踪不同的异步读取器;可能有几个。然而,这种动态数据结构并不依赖于所涉及的特定设备,并且内核提供了合适的通用实现,因此您不必在每个驱动程序中重写相同的代码。

While implementing the first step is trivial—there's nothing to do on the driver's part—the other steps involve maintaining a dynamic data structure to keep track of the different asynchronous readers; there might be several. This dynamic data structure, however, doesn't depend on the particular device involved, and the kernel offers a suitable general-purpose implementation so that you don't have to rewrite the same code in every driver.

Linux 提供的一般实现基于一种数据结构和两个函数(在前面描述的第二步和第三步中调用)。声明相关材料的头文件是<linux/fs.h> (这里没什么新内容),数据结构称为struct fasync_struct. 与等待队列一样,我们需要在设备特定的数据结构中插入一个指向该结构的指针。

The general implementation offered by Linux is based on one data structure and two functions (which are called in the second and third steps described earlier). The header that declares related material is <linux/fs.h> (nothing new here), and the data structure is called struct fasync_struct. As with wait queues, we need to insert a pointer to the structure in the device-specific data structure.

驱动程序调用的两个函数对应以下原型:

The two functions that the driver calls correspond to the following prototypes:

int fasync_helper(int fd, 结构文件 *filp,
       int 模式,struct fasync_struct **fa);
void Kill_fasync(struct fasync_struct **fa, int sig, int band);
int fasync_helper(int fd, struct file *filp,
       int mode, struct fasync_struct **fa);
void kill_fasync(struct fasync_struct **fa, int sig, int band);

fasync_helperFASYNC 当打开文件的标志发生更改时,调用该函数以在感兴趣的进程列表中添加或删除条目。除最后一个参数外,它的所有参数都提供给 fasync方法,并且可以直接传递。kill_fasync用于在数据到达时向感兴趣的进程发出信号。它的参数是要发送的信号(通常SIGIO)和频带,几乎总是POLL_IN [ 6 ](但可能用于在网络代码中发送“紧急”或带外数据)。

fasync_helper is invoked to add or remove entries from the list of interested processes when the FASYNC flag changes for an open file. All of its arguments except the last are provided to the fasync method and can be passed through directly. kill_fasync is used to signal the interested processes when data arrives. Its arguments are the signal to send (usually SIGIO) and the band, which is almost always POLL_IN [6] (but that may be used to send "urgent" or out-of-band data in the networking code).

以下是scullpipe实现fasync 方法的方式:

Here's how scullpipe implements the fasync method:

static int scull_p_fasync(int fd, 结构文件 *filp, int 模式)
{
    struct scull_pipe *dev = filp->private_data;

    返回 fasync_helper(fd, filp, 模式, &dev->async_queue);
}
static int scull_p_fasync(int fd, struct file *filp, int mode)
{
    struct scull_pipe *dev = filp->private_data;

    return fasync_helper(fd, filp, mode, &dev->async_queue);
}

很明显,所有工作都是由fasync_helper执行的。但是,如果驱动程序中没有方法,则不可能实现该功能,因为辅助函数需要访问指向struct fasync_struct *(此处&dev->async_queue)的正确指针,并且只有驱动程序可以提供信息。

It's clear that all the work is performed by fasync_helper. It wouldn't be possible, however, to implement the functionality without a method in the driver, because the helper function needs to access the correct pointer to struct fasync_struct * (here &dev->async_queue), and only the driver can provide the information.

当数据到达时,必须执行以下语句来向异步读取器发出信号。由于scullpipe读取器的新数据是由发出write 的进程生成的,因此该语句出现在 scullpipewrite方法中。

When data arrives, then, the following statement must be executed to signal asynchronous readers. Since new data for the scullpipe reader is generated by a process issuing a write, the statement appears in the write method of scullpipe.

if (dev->async_queue)
    Kill_fasync(&dev->async_queue, SIGIO, POLL_IN);
if (dev->async_queue)
    kill_fasync(&dev->async_queue, SIGIO, POLL_IN);

请注意,某些设备还实现了异步通知来指示设备何时可以写入;当然,在这种情况下,必须以 模式调用kill_fasyncPOLL_OUT

Note that some devices also implement asynchronous notification to indicate when the device can be written; in this case, of course, kill_fasync must be called with a mode of POLL_OUT.

看起来我们已经完成了,但仍然缺少一件事。当文件关闭时,我们必须调用fasync方法,以从活动异步读取器列表中删除该文件。filp->f_flags尽管仅当已设置时才需要此调用FASYNC ,但无论如何调用该函数都不会造成损害,并且是通常的实现。例如,以下几行是scullpipe释放方法的一部分:

It might appear that we're done, but there's still one thing missing. We must invoke our fasync method when the file is closed to remove the file from the list of active asynchronous readers. Although this call is required only if filp->f_flags has FASYNC set, calling the function anyway doesn't hurt and is the usual implementation. The following lines, for example, are part of the release method for scullpipe:

/* 从异步通知的 filp 中删除此 filp */
scull_p_fasync(-1​​, filp, 0);
/* remove this filp from the asynchronously notified filp's */
scull_p_fasync(-1, filp, 0);

异步通知底层的数据结构几乎与结构相同struct wait_queue,因为两种情况都涉及等待事件。区别在于 是struct file用来代替 的struct task_structstruct file然后使用队列中的 来检索, f_ownerin 命令向进程发出信号。

The data structure underlying asynchronous notification is almost identical to the structure struct wait_queue, because both situations involve waiting on an event. The difference is that struct file is used in place of struct task_struct. The struct file in the queue is then used to retrieve f_owner, in order to signal the process.

寻求设备

Seeking a Device

我们最后需要做的事情之一 本章介绍的是llseek 方法,该方法很有用(对于某些设备)并且易于实现。

One of the last things we need to cover in this chapter is the llseek method, which is useful (for some devices) and easy to implement.

llseek 的实现

The llseek Implementation

llseek方法实现lseekllseek系统调用。我们已经说过,如果 设备操作中缺少llseek方法,内核中的默认实现将通过修改文件中当前的读/写位置来执行查找。请注意,为了使lseek系统调用正常工作, 读取写入方法必须通过使用和更新它们作为参数接收的偏移量项来配合。filp->f_pos

The llseek method implements the lseek and llseek system calls. We have already stated that if the llseek method is missing from the device's operations, the default implementation in the kernel performs seeks by modifying filp->f_pos, the current reading/writing position within the file. Please note that for the lseek system call to work correctly, the read and write methods must cooperate by using and updating the offset item they receive as an argument.

如果查找操作对应于设备上的物理操作,您可能需要提供自己的llseek方法。在scull驱动中可以看到一个简单的例子:

You may need to provide your own llseek method if the seek operation corresponds to a physical operation on the device. A simple example can be seen in the scull driver:

loff_t scull_llseek(struct file *filp, loff_t off, int wherece)
{
    struct scull_dev *dev = filp->private_data;
    loff_t newpos;

    开关(何时){
      情况 0: /* SEEK_SET */
        新位置 = 关闭;
        休息;

      情况1:/* SEEK_CUR */
        newpos = filp->f_pos + off;
        休息;

      情况2:/* SEEK_END */
        newpos = dev->size + off;
        休息;

      默认值: /* 不会发生 */
        返回-EINVAL;
    }
    if (newpos < 0) 返回 -EINVAL;
    filp->f_pos = newpos;
    返回新位置;
}
loff_t scull_llseek(struct file *filp, loff_t off, int whence)
{
    struct scull_dev *dev = filp->private_data;
    loff_t newpos;

    switch(whence) {
      case 0: /* SEEK_SET */
        newpos = off;
        break;

      case 1: /* SEEK_CUR */
        newpos = filp->f_pos + off;
        break;

      case 2: /* SEEK_END */
        newpos = dev->size + off;
        break;

      default: /* can't happen */
        return -EINVAL;
    }
    if (newpos < 0) return -EINVAL;
    filp->f_pos = newpos;
    return newpos;
}

这里唯一特定于设备的操作是从设备检索文件长度。在scull中,read和write方法根据需要进行协作,如第 3 章 所示。

The only device-specific operation here is retrieving the file length from the device. In scull the read and write methods cooperate as needed, as shown in Chapter 3.

尽管刚刚显示的实现对于scull有意义,它处理定义明确的数据区域,但大多数设备提供数据流而不是数据区域(只需考虑串行端口或键盘),并且寻找这些设备没有意义。如果您的设备属于这种情况,您就不能不声明llseek操作,因为默认方法允许查找。相反,您应该通过在 open方法中调用nonseekable_open来通知内核您的设备不支持 llseek

Although the implementation just shown makes sense for scull, which handles a well-defined data area, most devices offer a data flow rather than a data area (just think about the serial ports or the keyboard), and seeking those devices does not make sense. If this is the case for your device, you can't just refrain from declaring the llseek operation, because the default method allows seeking. Instead, you should inform the kernel that your device does not support llseek by calling nonseekable_open in your open method:

int nonseekable_open(struct inode *inode; struct file *filp);
int nonseekable_open(struct inode *inode; struct file *filp);

这个调用将给定标记filp为不可查找;内核决不允许对此类文件的lseek调用成功。通过以这种方式标记文件,您还可以确保不会尝试通过preadpwrite系统调用来查找文件。

This call marks the given filp as being nonseekable; the kernel never allows an lseek call on such a file to succeed. By marking the file in this way, you can also be assured that no attempts will be made to seek the file by way of the pread and pwrite system calls.

为了完整起见,您还应该将结构中的llseek方法 设置file_operations为特殊帮助函数no_llseek ,该函数在<linux/fs.h>中定义。

For completeness, you should also set the llseek method in your file_operations structure to the special helper function no_llseek, which is defined in <linux/fs.h>.

设备文件的访问控制

Access Control on a Device File

提供访问控制有时对于设备节点的可靠性至关重要。不仅不允许未经授权的用户使用该设备(由文件系统权限位强制执行限制),而且有时一次只应允许一个授权用户打开该设备。

Offering access control is sometimes vital for the reliability of a device node. Not only should unauthorized users not be permitted to use the device (a restriction is enforced by the filesystem permission bits), but sometimes only one authorized user should be allowed to open the device at a time.

该问题与使用 ttys 的问题类似。在这种情况下, 登录 每当用户登录系统时,进程都会更改设备节点的所有权,以防止其他用户干扰或嗅探 tty 数据流。但是,每次打开设备时使用特权程序更改设备的所有权以授予对其的唯一访问权限是不切实际的。

The problem is similar to that of using ttys. In that case, the login process changes the ownership of the device node whenever a user logs into the system, in order to prevent other users from interfering with or sniffing the tty data flow. However, it's impractical to use a privileged program to change the ownership of a device every time it is opened just to grant unique access to it.

到目前为止显示的代码都没有实现超出文件系统权限位的任何访问控制。如果open系统调用将请求转发给驱动程序,则open成功。我们现在介绍一些用于实施一些附加检查的技术。

None of the code shown up to now implements any access control beyond the filesystem permission bits. If the open system call forwards the request to the driver, open succeeds. We now introduce a few techniques for implementing some additional checks.

本节中显示的每个设备都具有与裸scull设备相同的行为 (即,它实现了持久内存区域),但在访问控制方面 与scull不同,访问控制是在打开释放操作中实现的。

Every device shown in this section has the same behavior as the bare scull device (that is, it implements a persistent memory area) but differs from scull in access control, which is implemented in the open and release operations.

单开设备

Single-Open Devices

提供访问控制的强力方法是允许设备一次仅由一个进程打开(单次打开)。最好避免这种技术,因为它会抑制用户的创造力。用户可能希望在同一设备上运行不同的进程,一个进程读取状态信息,另一个进程写入数据。在某些情况下,用户可以通过 shell 脚本运行几个简单的程序来完成很多工作,只要他们可以同时访问设备。换句话说,实现单一打开行为相当于创建策略,这可能会妨碍用户想要执行的操作。

The brute-force way to provide access control is to permit a device to be opened by only one process at a time (single openness). This technique is best avoided because it inhibits user ingenuity. A user might want to run different processes on the same device, one reading status information while the other is writing data. In some cases, users can get a lot done by running a few simple programs through a shell script, as long as they can access the device concurrently. In other words, implementing a single-open behavior amounts to creating policy, which may get in the way of what your users want to do.

仅允许单个进程打开设备具有不良属性,但它也是设备驱动程序实现的最简单的访问控制,因此此处显示。源代码是从名为scullsingle 的设备中提取的。

Allowing only a single process to open a device has undesirable properties, but it is also the easiest access control to implement for a device driver, so it's shown here. The source code is extracted from a device called scullsingle.

人双桨 设备维护一个atomic_t名为 的变量scull_s_available;该变量被初始化为值 1,表明该设备确实可用。如果其他人已经打开了设备,则开放调用会递减并测试并拒绝访问:scull_s_available

The scullsingle device maintains an atomic_t variable called scull_s_available; that variable is initialized to a value of one, indicating that the device is indeed available. The open call decrements and tests scull_s_available and refuses access if somebody else already has the device open:

静态atomic_t scull_s_available = ATOMIC_INIT(1);

static int scull_s_open(结构 inode *inode, 结构文件 *filp)
{
    struct scull_dev *dev = &scull_s_device; /* 设备信息 */

    if (!atomic_dec_and_test (&scull_s_available)) {
        atomic_inc(&scull_s_available);
        返回-EBUSY;/* 已经打开 */
    }

    /* 然后,其他所有内容都从裸 scull 设备复制 */
    if ( (filp->f_flags & O_ACCMODE) == O_WRONLY)
        scull_trim(开发);
    filp->private_data = dev;
    返回0;/* 成功 */
}
static atomic_t scull_s_available = ATOMIC_INIT(1);

static int scull_s_open(struct inode *inode, struct file *filp)
{
    struct scull_dev *dev = &scull_s_device; /* device information */

    if (! atomic_dec_and_test (&scull_s_available)) {
        atomic_inc(&scull_s_available);
        return -EBUSY; /* already open */
    }

    /* then, everything else is copied from the bare scull device */
    if ( (filp->f_flags & O_ACCMODE) =  = O_WRONLY)
        scull_trim(dev);
    filp->private_data = dev;
    return 0;          /* success */
}

另一方面,release调用 将设备标记为不再忙碌:

The release call, on the other hand, marks the device as no longer busy:

static int scull_s_release(struct inode *inode, struct file *filp)
{
    atomic_inc(&scull_s_available); /* 释放设备 */
    返回0;
}
static int scull_s_release(struct inode *inode, struct file *filp)
{
    atomic_inc(&scull_s_available); /* release the device */
    return 0;
}

通常,我们建议您将打开标志scull_s_available放在设备结构中(Scull_Dev此处),因为从概念上讲,它属于设备。然而, scull驱动程序 使用独立变量来保存标志,因此它可以使用与裸scull 设备相同的设备结构和方法,并最大限度地减少代码重复。

Normally, we recommend that you put the open flag scull_s_available within the device structure (Scull_Dev here) because, conceptually, it belongs to the device. The scull driver, however, uses standalone variables to hold the flag so it can use the same device structure and methods as the bare scull device and minimize code duplication.

一次限制单个用户的访问

Restricting Access to a Single User at a Time

下一步超越 单开设备是指让单个用户在多个进程中打开一个设备,但一次只允许一个用户打开该设备。该解决方案使测试设备变得容易,因为用户可以同时从多个进程读取和写入,但假设用户在多次访问期间承担维护数据完整性的责任。这是通过在open方法中添加检查来完成的 ;此类检查是在之后执行的正常的权限检查,只能使访问比所有者和组权限位指定的访问更具限制性。这与用于 tty 的访问策略相同,但它不诉诸外部特权程序。

The next step beyond a single-open device is to let a single user open a device in multiple processes but allow only one user to have the device open at a time. This solution makes it easy to test the device, since the user can read and write from several processes at once, but assumes that the user takes some responsibility for maintaining the integrity of the data during multiple accesses. This is accomplished by adding checks in the open method; such checks are performed after the normal permission checking and can only make access more restrictive than that specified by the owner and group permission bits. This is the same access policy as that used for ttys, but it doesn't resort to an external privileged program.

这些访问策略的实施比单开放策略要复杂一些。在这种情况下,需要两个项目:开放计数和设备“所有者”的 uid。再次强调,此类项目的最佳位置是在设备结构内;我们的示例使用全局变量,原因之前已解释过 scullsingle。该设备的名称是 sculluid

Those access policies are a little trickier to implement than single-open policies. In this case, two items are needed: an open count and the uid of the "owner" of the device. Once again, the best place for such items is within the device structure; our example uses global variables instead, for the reason explained earlier for scullsingle. The name of the device is sculluid.

open调用在第一次打开时授予访问权限,但会记住设备的所有者。这意味着用户可以多次打开设备,从而允许协作进程在设备上同时工作。同时,其他用户也无法打开它,从而避免了外部干扰。由于此版本的功能与前一版本几乎相同,因此此处仅复制相关部分:

The open call grants access on first open but remembers the owner of the device. This means that a user can open the device multiple times, thus allowing cooperating processes to work concurrently on the device. At the same time, no other user can open it, thus avoiding external interference. Since this version of the function is almost identical to the preceding one, only the relevant part is reproduced here:

    spin_lock(&scull_u_lock);
    如果(scull_u_count &&
            (scull_u_owner != current->uid) && /* 允许用户 */
            (scull_u_owner != current->euid) && /* 允许执行 su 的人 */
            !capable(CAP_DAC_OVERRIDE)) { /* 仍然允许 root */
        spin_unlock(&scull_u_lock);
        返回-EBUSY;/* -EPERM 会让用户感到困惑 */
    }

    如果(scull_u_count == 0)
        scull_u_owner = 当前->uid; /* 抓住它 */

    scull_u_count++;
    spin_unlock(&scull_u_lock);
    spin_lock(&scull_u_lock);
    if (scull_u_count && 
            (scull_u_owner != current->uid) &&  /* allow user */
            (scull_u_owner != current->euid) && /* allow whoever did su */
            !capable(CAP_DAC_OVERRIDE)) { /* still allow root */
        spin_unlock(&scull_u_lock);
        return -EBUSY;   /* -EPERM would confuse the user */
    }

    if (scull_u_count =  = 0)
        scull_u_owner = current->uid; /* grab it */

    scull_u_count++;
    spin_unlock(&scull_u_lock);

请注意,sculluid 代码有两个变量(scull_u_ownerscull_u_count),它们控制对设备的访问,并且可以由多个进程同时访问。为了确保这些变量的安全,我们使用自旋锁()来控制对它们的访问scull_u_lock。如果没有这种锁定,两个(或更多)进程可以scull_u_count同时进行测试,并且两个进程都可以得出结论:它们有权获得设备的所有权。这里表示自旋锁,因为锁的持有时间很短,并且驱动程序在持有锁时不会做任何可能休眠的事情。

Note that the sculluid code has two variables (scull_u_owner and scull_u_count) that control access to the device and that could be accessed concurrently by multiple processes. To make these variables safe, we control access to them with a spinlock (scull_u_lock). Without that locking, two (or more) processes could test scull_u_count at the same time, and both could conclude that they were entitled to take ownership of the device. A spinlock is indicated here, because the lock is held for a very short time, and the driver does nothing that could sleep while holding the lock.

即使代码正在执行权限检查,我们还是选择了 return-EBUSY和 not -EPERM,以便为被拒绝访问的用户指明正确的方向。对“权限被拒绝”的反应通常是检查/dev文件的模式和所有者,而“设备繁忙”则正确地建议用户应该查找已在使用该设备的进程。

We chose to return -EBUSY and not -EPERM, even though the code is performing a permission check, in order to point a user who is denied access in the right direction. The reaction to "Permission denied" is usually to check the mode and owner of the /dev file, while "Device busy" correctly suggests that the user should look for a process already using the device.

此代码还会检查尝试打开的进程是否能够覆盖文件访问权限;如果是这样,即使打开进程不是设备的所有者,也允许打开。在这种情况下,该CAP_DAC_OVERRIDE功能非常适合该任务。

This code also checks to see if the process attempting the open has the ability to override file access permissions; if so, the open is allowed even if the opening process is not the owner of the device. The CAP_DAC_OVERRIDE capability fits the task well in this case.

释放方法如下所示:

The release method looks like the following:

static int scull_u_release(结构 inode *inode, 结构文件 *filp)
{
    spin_lock(&scull_u_lock);
    scull_u_count--; /* 没有其他的 */
    spin_unlock(&scull_u_lock);
    返回0;
}
static int scull_u_release(struct inode *inode, struct file *filp)
{
    spin_lock(&scull_u_lock);
    scull_u_count--; /* nothing else */
    spin_unlock(&scull_u_lock);
    return 0;
}

再一次,我们必须在修改计数之前获得锁,以确保我们不会与另一个进程竞争。

Once again, we must obtain the lock prior to modifying the count to ensure that we do not race with another process.

阻塞打开作为 EBUSY 的替代方案

Blocking open as an Alternative to EBUSY

当设备不可访问时,返回错误通常是最明智的方法,但在某些情况下,用户更愿意等待设备。

When the device isn't accessible, returning an error is usually the most sensible approach, but there are situations in which the user would prefer to wait for the device.

例如,如果数据通信通道既用于定期、按计划传输报告(使用crontab),又用于根据人们的需要临时使用,则计划的操作稍微延迟比仅仅因为执行失败而失败要好得多。频道当前正忙。

For example, if a data communication channel is used both to transmit reports on a regular, scheduled basis (using crontab) and for casual usage according to people's needs, it's much better for the scheduled operation to be slightly delayed rather than fail just because the channel is currently busy.

这是程序员在设计设备驱动程序时必须做出的选择之一,正确的答案取决于要解决的特定问题。

This is one of the choices that the programmer must make when designing a device driver, and the right answer depends on the particular problem being solved.

正如您可能已经猜到的,替代方案EBUSY是实现阻塞openscullwuid设备是sculluid 的一个版本,它等待设备打开 不是返回-EBUSY。它与sculluid的区别仅在于打开操作的以下部分:

The alternative to EBUSY, as you may have guessed, is to implement blocking open. The scullwuid device is a version of sculluid that waits for the device on open instead of returning -EBUSY. It differs from sculluid only in the following part of the open operation:

spin_lock(&scull_w_lock);
while (!scull_w_available( )) {
    spin_unlock(&scull_w_lock);
    if (filp->f_flags & O_NONBLOCK) 返回 -EAGAIN;
    if (wait_event_interruptible (scull_w_wait, scull_w_available( )))
        返回-ERESTARTSYS;/* 告诉 fs 层处理它 */
    spin_lock(&scull_w_lock);
}
如果(scull_w_count == 0)
    scull_w_owner = 当前->uid; /* 抓住它 */
scull_w_count++;
spin_unlock(&scull_w_lock);
spin_lock(&scull_w_lock);
while (! scull_w_available(  )) {
    spin_unlock(&scull_w_lock);
    if (filp->f_flags & O_NONBLOCK) return -EAGAIN;
    if (wait_event_interruptible (scull_w_wait, scull_w_available(  )))
        return -ERESTARTSYS; /* tell the fs layer to handle it */
    spin_lock(&scull_w_lock);
}
if (scull_w_count =  = 0)
    scull_w_owner = current->uid; /* grab it */
scull_w_count++;
spin_unlock(&scull_w_lock);

该实现再次基于等待队列。如果该设备当前不可用,则尝试打开该设备的进程将被放置在等待队列中,直到拥有该设备的进程关闭该设备。

The implementation is based once again on a wait queue. If the device is not currently available, the process attempting to open it is placed on the wait queue until the owning process closes the device.

那么, release 方法负责唤醒任何挂起的进程:

The release method, then, is in charge of awakening any pending process:

static int scull_w_release(结构 inode *inode, 结构文件 *filp)
{
    国际温度;

    spin_lock(&scull_w_lock);
    scull_w_count--;
    临时= scull_w_count;
    spin_unlock(&scull_w_lock);

    如果(温度== 0)
        wake_up_interruptible_sync(&scull_w_wait); /* 唤醒其他uid */
    返回0;
}
static int scull_w_release(struct inode *inode, struct file *filp)
{
    int temp;

    spin_lock(&scull_w_lock);
    scull_w_count--;
    temp = scull_w_count;
    spin_unlock(&scull_w_lock);

    if (temp =  = 0)
        wake_up_interruptible_sync(&scull_w_wait); /* awake other uid's */
    return 0;
}

以下是调用wake_up_interruptible_sync 有意义的示例。当我们进行唤醒时,我们即将返回用户空间,这是系统的自然调度点。与其在唤醒时重新安排时间,不如直接调用“同步”版本并完成我们的工作。

Here is an example of where calling wake_up_interruptible_sync makes sense. When we do the wakeup, we are just about to return to user space, which is a natural scheduling point for the system. Rather than potentially reschedule when we do the wakeup, it is better to just call the "sync" version and finish our job.

阻塞开放实现的问题在于,对于交互式用户来说,这确实令人不快,他们必须不断猜测出了什么问题。交互式用户通常调用标准命令,例如cptar,并且不能仅添加O_NONBLOCKopen调用中。使用隔壁房间的磁带驱动器进行备份的人更愿意收到简单的“设备或资源繁忙”消息,而不是猜测为什么硬盘驱动器今天如此安静,而 tar 应该扫描

The problem with a blocking-open implementation is that it is really unpleasant for the interactive user, who has to keep guessing what is going wrong. The interactive user usually invokes standard commands, such as cp and tar, and can't just add O_NONBLOCK to the open call. Someone who's making a backup using the tape drive in the next room would prefer to get a plain "device or resource busy" message instead of being left to guess why the hard drive is so silent today, while tar should be scanning it.

此类问题(同一设备需要不同的、不兼容的策略)通常最好的解决方法是为每个访问策略实现一个设备节点。这种做法的一个例子可以在 Linux 磁带驱动程序中找到,它为同一设备提供多个设备文件。例如,不同的设备文件将导致驱动器在压缩或不压缩的情况下进行记录,或者在设备关闭时自动倒带。

This kind of problem (a need for different, incompatible policies for the same device) is often best solved by implementing one device node for each access policy. An example of this practice can be found in the Linux tape driver, which provides multiple device files for the same device. Different device files will, for example, cause the drive to record with or without compression, or to automatically rewind the tape when the device is closed.

在打开时克隆设备

Cloning the Device on open

其他 管理访问控制的技术是根据打开设备的进程创建设备的不同私有副本。

Another technique to manage access control is to create different private copies of the device, depending on the process opening it.

显然,只有当设备未绑定到硬件对象时,这才是可能的。 scull就是这种“软件”设备的一个例子。/dev/tty的内部结构 使用类似的技术,以便为其进程提供/dev入口点所代表内容的不同“视图”。当软件驱动程序创建设备的副本时,我们将其称为虚拟设备— 就像虚拟控制台使用单个物理 tty 设备一样。

Clearly, this is possible only if the device is not bound to a hardware object; scull is an example of such a "software" device. The internals of /dev/tty use a similar technique in order to give its process a different "view" of what the /dev entry point represents. When copies of the device are created by the software driver, we call them virtual devices—just as virtual consoles use a single physical tty device.

尽管很少需要这种访问控制,但其实现可以启发性地展示内核代码如何轻松地改变应用程序对周围世界(即计算机)的看法。

Although this kind of access control is rarely needed, the implementation can be enlightening in showing how easily kernel code can change the application's perspective of the surrounding world (i.e., the computer).

/dev/scullpriv设备节点在scull包中实现虚拟设备。scullpriv 实现使用进程控制 tty 的设备号作为访问虚拟设备的密钥尽管如此,您可以轻松修改源以使用任何整数值作为键;每个选择都会导致不同的政策。例如,使用 uid每个用户通向不同虚拟设备的路径,而使用pid密钥为每个访问该设备的进程创建一个新设备。

The /dev/scullpriv device node implements virtual devices within the scull package. The scullpriv implementation uses the device number of the process's controlling tty as a key to access the virtual device. Nonetheless, you can easily modify the sources to use any integer value for the key; each choice leads to a different policy. For example, using the uid leads to a different virtual device for each user, while using a pid key creates a new device for each process accessing it.

使用控制终端的决定是为了使用 I/O 重定向轻松测试设备:该设备由同一虚拟终端上运行的所有命令共享,并与另一终端上运行的命令所看到的设备分开。

The decision to use the controlling terminal is meant to enable easy testing of the device using I/O redirection: the device is shared by all commands run on the same virtual terminal and is kept separate from the one seen by commands run on another terminal.

open方法类似于以下代码。它必须寻找正确的虚拟设备并可能创建一个。函数的最后部分没有显示,因为它是从我们已经看到的裸scull复制的。

The open method looks like the following code. It must look for the right virtual device and possibly create one. The final part of the function is not shown because it is copied from the bare scull, which we've already seen.

/* 克隆特定的数据结构包括一个关键字段 */

结构 scull_listitem {
    结构 scull_dev 设备;
    dev_t 键;
    struct list_head 列表;
    
};

/* 设备列表,以及保护它的锁 */
静态 LIST_HEAD(scull_c_list);
静态 spinlock_t scull_c_lock = SPIN_LOCK_UNLOCKED;

/* 查找设备,如果丢失则创建一个 */
静态结构 scull_dev *scull_c_lookfor_device(dev_t key)
{
    结构 scull_listitem *lptr;

    list_for_each_entry(lptr, &scull_c_list, 列表) {
        if (lptr->key == key)
            返回&(lptr->设备);
    }

    /* 未找到 */
    lptr = kmalloc(sizeof(struct scull_listitem), GFP_KERNEL);
    如果(!lptr)
        返回空值;

    /* 初始化设备 */
    memset(lptr, 0, sizeof(struct scull_listitem));
    lptr->键=键;
    scull_trim(&(lptr->设备)); /* 初始化它 */
    init_MUTEX(&(lptr->device.sem));

    /* 将其放入列表中 */
    list_add(&lptr->list, &scull_c_list);

    返回&(lptr->设备);
}

static int scull_c_open(结构 inode *inode, 结构文件 *filp)
{
    结构 scull_dev *dev;
    dev_t 键;
 
    if (!current->signal->tty) {
        PDEBUG("进程\"%s\"没有ctl tty\n", current->comm);
        返回-EINVAL;
    }
    key = tty_devnum(当前->信号->tty);

    /* 在列表中查找 scullc 设备 */
    spin_lock(&scull_c_lock);
    dev = scull_c_lookfor_device(key);
    spin_unlock(&scull_c_lock);

    如果(!dev)
        返回-ENOMEM;

    /* 然后,其他所有内容都从裸 scull 设备复制 */
/* The clone-specific data structure includes a key field */

struct scull_listitem {
    struct scull_dev device;
    dev_t key;
    struct list_head list;
    
};

/* The list of devices, and a lock to protect it */
static LIST_HEAD(scull_c_list);
static spinlock_t scull_c_lock = SPIN_LOCK_UNLOCKED;

/* Look for a device or create one if missing */
static struct scull_dev *scull_c_lookfor_device(dev_t key)
{
    struct scull_listitem *lptr;

    list_for_each_entry(lptr, &scull_c_list, list) {
        if (lptr->key =  = key)
            return &(lptr->device);
    }

    /* not found */
    lptr = kmalloc(sizeof(struct scull_listitem), GFP_KERNEL);
    if (!lptr)
        return NULL;

    /* initialize the device */
    memset(lptr, 0, sizeof(struct scull_listitem));
    lptr->key = key;
    scull_trim(&(lptr->device)); /* initialize it */
    init_MUTEX(&(lptr->device.sem));

    /* place it in the list */
    list_add(&lptr->list, &scull_c_list);

    return &(lptr->device);
}

static int scull_c_open(struct inode *inode, struct file *filp)
{
    struct scull_dev *dev;
    dev_t key;
 
    if (!current->signal->tty) { 
        PDEBUG("Process \"%s\" has no ctl tty\n", current->comm);
        return -EINVAL;
    }
    key = tty_devnum(current->signal->tty);

    /* look for a scullc device in the list */
    spin_lock(&scull_c_lock);
    dev = scull_c_lookfor_device(key);
    spin_unlock(&scull_c_lock);

    if (!dev)
        return -ENOMEM;

    /* then, everything else is copied from the bare scull device */

释放方法没有什么特别的。它通常会在最后一次关闭时释放设备,但我们选择不维护打开计数以简化驱动程序的测试。如果设备在上次关闭时被释放,则在写入设备后您将无法读取相同的数据,除非后台进程将其保持打开状态。示例驱动程序采用更简单的方法来保存数据,以便在下一次打开时,您会在那里找到它。scull_cleanup时释放设备 叫做。

The release method does nothing special. It would normally release the device on last close, but we chose not to maintain an open count in order to simplify the testing of the driver. If the device were released on last close, you wouldn't be able to read the same data after writing to the device, unless a background process were to keep it open. The sample driver takes the easier approach of keeping the data, so that at the next open, you'll find it there. The devices are released when scull_cleanup is called.

此代码优先使用通用 Linux 链表机制,而不是从头开始重新实现相同的功能。Linux 列表将在第 11 章中讨论。

This code uses the generic Linux linked list mechanism in preference to reimplementing the same capability from scratch. Linux lists are discussed in Chapter 11.

这是/dev/scullpriv发布实现,它结束了对设备方法的讨论。

Here's the release implementation for /dev/scullpriv, which closes the discussion of device methods.

static int scull_c_release(struct inode *inode, struct file *filp)
{
    /*
     * 无需执行任何操作,因为设备是持久的。
     *“真正的”克隆设备应在最后一次关闭时释放
     */
    返回0;



}
static int scull_c_release(struct inode *inode, struct file *filp)
{
    /*
     * Nothing to do, because the device is persistent.
     * A `real' cloned device should be freed on last close
     */
    return 0;



}

快速参考

Quick Reference

本章介绍了以下符号和头文件:

This chapter introduced the following symbols and header files:

#include <linux/ioctl.h>
#include <linux/ioctl.h>

声明用于定义ioctl命令的所有宏 。它当前包含在<linux/fs.h>中。

Declares all the macros used to define ioctl commands. It is currently included by <linux/fs.h>.

_IOC_NRBITS

_IOC_TYPEBITS

_IOC_SIZEBITS

_IOC_DIRBITS
_IOC_NRBITS

_IOC_TYPEBITS

_IOC_SIZEBITS

_IOC_DIRBITS

ioctl命令的不同位字段可用的位数。还有四个指定MASKs 的宏和四个指定SHIFTs 的宏,但它们主要供内部使用。_IOC_SIZEBITS是一个需要检查的重要值,因为它会随着架构的不同而变化。

The number of bits available for the different bitfields of ioctl commands. There are also four macros that specify the MASKs and four that specify the SHIFTs, but they're mainly for internal use. _IOC_SIZEBITS is an important value to check, because it changes across architectures.

_IOC_NONE

_IOC_READ

_IOC_WRITE
_IOC_NONE

_IOC_READ

_IOC_WRITE

“方向”位字段的可能值。“读”和“写”是不同的位,可以通过“或”来指定读/写。这些值从 0 开始。

The possible values for the "direction" bitfield. "Read" and "write" are different bits and can be ORed to specify read/write. The values are 0-based.

_IOC(dir,type,nr,size)

_IO(type,nr)

_IOR(type,nr,size)

_IOW(type,nr,size)

_IOWR(type,nr,size)
_IOC(dir,type,nr,size)

_IO(type,nr)

_IOR(type,nr,size)

_IOW(type,nr,size)

_IOWR(type,nr,size)

用于创建的宏 ioctl 命令。

Macros used to create an ioctl command.

_IOC_DIR(nr)

_IOC_TYPE(nr)

_IOC_NR(nr)

_IOC_SIZE(nr)
_IOC_DIR(nr)

_IOC_TYPE(nr)

_IOC_NR(nr)

_IOC_SIZE(nr)

用于解码命令的宏。特别地,是和_IOC_TYPE(nr)的 OR 组合。_IOC_READ_IOC_WRITE

Macros used to decode a command. In particular, _IOC_TYPE(nr) is an OR combination of _IOC_READ and _IOC_WRITE.

#include <asm/uaccess.h>

int access_ok(int type, const void *addr, unsigned long size);
#include <asm/uaccess.h>

int access_ok(int type, const void *addr, unsigned long size);

检查指向用户空间的指针是否确实可用。 如果应该允许访问,access_ok将返回一个非零值。

Checks that a pointer to user space is actually usable. access_ok returns a nonzero value if the access should be allowed.

VERIFY_READ

VERIFY_WRITE
VERIFY_READ

VERIFY_WRITE

access_oktype中参数 的可能值。是 的超集。VERIFY_WRITEVERIFY_READ

The possible values for the type argument in access_ok. VERIFY_WRITE is a superset of VERIFY_READ.

#include <asm/uaccess.h>

int put_user(datum,ptr);

int get_user(local,ptr);

int _ _put_user(datum,ptr);

int _ _get_user(local,ptr);
#include <asm/uaccess.h>

int put_user(datum,ptr);

int get_user(local,ptr);

int _ _put_user(datum,ptr);

int _ _get_user(local,ptr);

用于在用户空间中存储或检索数据的宏。传输的字节数取决于sizeof(*ptr)。常规版本首先调用access_ok,而限定版本(__put_user__get_user)假定 access_ok已经被调用。

Macros used to store or retrieve a datum to or from user space. The number of bytes being transferred depends on sizeof(*ptr). The regular versions call access_ok first, while the qualified versions (_ _put_user and _ _get_user) assume that access_ok has already been called.

#include <linux/capability.h>
#include <linux/capability.h>

CAP_定义描述用户空间进程可能具有的功能的各种符号。

Defines the various CAP_ symbols describing the capabilities a user-space process may have.

int capable(int capability);
int capable(int capability);

如果进程具有给定的能力,则返回非零值。

Returns nonzero if the process has the given capability.

#include <linux/wait.h>

typedef struct { /* ... */ } wait_queue_head_t;

void init_waitqueue_head(wait_queue_head_t *queue);

DECLARE_WAIT_QUEUE_HEAD(queue);
#include <linux/wait.h>

typedef struct { /* ... */ } wait_queue_head_t;

void init_waitqueue_head(wait_queue_head_t *queue);

DECLARE_WAIT_QUEUE_HEAD(queue);

定义的类型 对于 Linux 等待队列。A必须在运行时使用init_waitqueue_head或 在编译时使用DECLARE_WAIT_QUEUE_HEADwait_queue_head_t显式初始化。

The defined type for Linux wait queues. A wait_queue_head_t must be explicitly initialized with either init_waitqueue_head at runtime or DECLARE_WAIT_QUEUE_HEAD at compile time.

void wait_event(wait_queue_head_t q, int condition);

int wait_event_interruptible(wait_queue_head_t q, int condition);

int wait_event_timeout(wait_queue_head_t q, int condition, int time);

int wait_event_interruptible_timeout(wait_queue_head_t q, int condition,

int time);
void wait_event(wait_queue_head_t q, int condition);

int wait_event_interruptible(wait_queue_head_t q, int condition);

int wait_event_timeout(wait_queue_head_t q, int condition, int time);

int wait_event_interruptible_timeout(wait_queue_head_t q, int condition,

int time);

使进程在给定队列上休眠,直到给定的condition计算结果为真值。

Cause the process to sleep on the given queue until the given condition evaluates to a true value.

void wake_up(struct wait_queue **q);

void wake_up_interruptible(struct wait_queue **q);

void wake_up_nr(struct wait_queue **q, int nr);

void wake_up_interruptible_nr(struct wait_queue **q, int nr);

void wake_up_all(struct wait_queue **q);

void wake_up_interruptible_all(struct wait_queue **q);

void wake_up_interruptible_sync(struct wait_queue **q);
void wake_up(struct wait_queue **q);

void wake_up_interruptible(struct wait_queue **q);

void wake_up_nr(struct wait_queue **q, int nr);

void wake_up_interruptible_nr(struct wait_queue **q, int nr);

void wake_up_all(struct wait_queue **q);

void wake_up_interruptible_all(struct wait_queue **q);

void wake_up_interruptible_sync(struct wait_queue **q);

唤醒队列中休眠的进程q。_interruptible 形式仅唤醒可中断进程通常,只有一个独占服务员被唤醒,但可以使用_nr_all形式更改该行为。_sync版本 在返回之前不会重新调度CPU。

Wake processes that are sleeping on the queue q. The _interruptible form wakes only interruptible processes. Normally, only one exclusive waiter is awakened, but that behavior can be changed with the _nr or _all forms. The _sync version does not reschedule the CPU before returning.

#include <linux/sched.h>

set_current_state(int state);
#include <linux/sched.h>

set_current_state(int state);

设置当前进程的执行状态。TASK_RUNNING表示已准备好运行,而睡眠状态为 TASK_INTERRUPTIBLETASK_UNINTERRUPTIBLE

Sets the execution state of the current process. TASK_RUNNING means it is ready to run, while the sleep states are TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE.

void schedule(void);
void schedule(void);

从运行队列中选择一个可运行的进程。所选择的过程可以是current一个或不同的过程。

Selects a runnable process from the run queue. The chosen process can be current or a different one.

typedef struct { /* ... */ } wait_queue_t;

init_waitqueue_entry(wait_queue_t *entry, struct task_struct *task);
typedef struct { /* ... */ } wait_queue_t;

init_waitqueue_entry(wait_queue_t *entry, struct task_struct *task);

wait_queue_t类型用于将进程放入等待队列。

The wait_queue_t type is used to place a process onto a wait queue.

void prepare_to_wait(wait_queue_head_t *queue, wait_queue_t *wait, int state);

void prepare_to_wait_exclusive(wait_queue_head_t *queue, wait_queue_t *wait,

int state);

void finish_wait(wait_queue_head_t *queue, wait_queue_t *wait);
void prepare_to_wait(wait_queue_head_t *queue, wait_queue_t *wait, int state);

void prepare_to_wait_exclusive(wait_queue_head_t *queue, wait_queue_t *wait,

int state);

void finish_wait(wait_queue_head_t *queue, wait_queue_t *wait);

可用于编写手动睡眠代码的辅助函数。

Helper functions that can be used to code a manual sleep.

void sleep_on(wiat_queue_head_t *queue);

void interruptible_sleep_on(wiat_queue_head_t *queue);
void sleep_on(wiat_queue_head_t *queue);

void interruptible_sleep_on(wiat_queue_head_t *queue);

无条件使当前进程进入睡眠状态的已过时和已弃用的函数。

Obsolete and deprecated functions that unconditionally put the current process to sleep.

#include <linux/poll.h>

void poll_wait(struct file *filp, wait_queue_head_t *q, poll_table *p)
#include <linux/poll.h>

void poll_wait(struct file *filp, wait_queue_head_t *q, poll_table *p)

将当前进程放入等待队列,而不立即调度。它被设计为供设备驱动程序的poll方法使用。

Places the current process into a wait queue without scheduling immediately. It is designed to be used by the poll method of device drivers.

int fasync_helper(struct inode *inode, struct file *filp, int mode, struct

fasync_struct **fa);
int fasync_helper(struct inode *inode, struct file *filp, int mode, struct

fasync_struct **fa);

用于实现fasync设备方法的“助手” 。该mode参数与传递给该方法的值相同,同时fa指向设备特定的 fasync_struct *.

A "helper" for implementing the fasync device method. The mode argument is the same value that is passed to the method, while fa points to a device-specific fasync_struct *.

void kill_fasync(struct fasync_struct *fa, int sig, int band);
void kill_fasync(struct fasync_struct *fa, int sig, int band);

如果驱动程序支持异步通知,则可以使用此函数向在 中注册的进程发送信号fa

If the driver supports asynchronous notification, this function can be used to send a signal to processes registered in fa.

int nonseekable_open(struct inode *inode, struct file *filp);

loff_t no_llseek(struct file *file, loff_t offset, int whence);
int nonseekable_open(struct inode *inode, struct file *filp);

loff_t no_llseek(struct file *file, loff_t offset, int whence);

nonseekable_open应该在任何不支持查找的设备的open方法中调用 此类设备还应使用no_llseek作为其 llseek方法。

nonseekable_open should be called in the open method of any device that does not support seeking. Such devices should also use no_llseek as their llseek method.




[ 1 ]然而,最近对该文件的维护有些稀缺。

[1] Maintenance of this file has been somewhat scarce as of late, however.

[ 2 ]实际上,当前使用的所有libc实现(包括 uClibc)仅将 -4095 到 -1 范围内的值视为错误代码。不幸的是,能够返回大负数但不能返回小负数并不是很有用。

[2] Actually, all libc implementations currently in use (including uClibc) consider as error codes only values in the range -4095 to -1. Unfortunately, being able to return large negative numbers but not small ones is not very useful.

[ 3 ]它有一个富有想象力的名称default_wake_function

[3] It has the imaginative name default_wake_function.

[ 4 ]实际上,epoll是一组三个调用,它们一起可以用来实现轮询功能。不过,出于我们的目的,我们可以将其视为单个调用。

[4] Actually, epoll is a set of three calls that together can be used to achieve the polling functionality. For our purposes, though, we can think of it as a single call.

[ 5 ]该函数为将来调用epoll_wait设置内部数据结构 。

[5] This is the function that sets up the internal data structure for future calls to epoll_wait.

[ 6 ]POLL_IN是异步通知代码中使用的符号;它相当于POLLIN|POLLRDNORM.

[6] POLL_IN is a symbol used in the asynchronous notification code; it is equivalent to POLLIN|POLLRDNORM.

第 7 章时间、延误和延期工作

Chapter 7. Time, Delays, and Deferred Work

至此,我们已经了解了如何编写功能齐全的 char 模块的基础知识。然而,现实世界的驱动程序需要做的不仅仅是实现控制设备的操作;他们必须处理诸如计时、内存管理、硬件访问等问题。幸运的是,内核导出了许多工具来减轻驱动程序编写者的任务。在接下来的几章中,我们将描述一些可以使用的内核资源。本章通过描述如何解决时序问题来引导。处理时间涉及以下任务(按复杂程度递增):

At this point, we know the basics of how to write a full-featured char module. Real-world drivers, however, need to do more than implement the operations that control a device; they have to deal with issues such as timing, memory management, hardware access, and more. Fortunately, the kernel exports a number of facilities to ease the task of the driver writer. In the next few chapters, we'll describe some of the kernel resources you can use. This chapter leads the way by describing how timing issues are addressed. Dealing with time involves the following tasks, in order of increasing complexity:

  • 测量时间流逝并比较时间

  • Measuring time lapses and comparing times

  • 知道当前时间

  • Knowing the current time

  • 将操作延迟指定的时间

  • Delaying operation for a specified amount of time

  • 安排异步函数稍后发生

  • Scheduling asynchronous functions to happen at a later time

测量时间流逝

Measuring Time Lapses

内核跟踪 通过定时器中断来控制时间的流动。第 10 章详细介绍了中断。

The kernel keeps track of the flow of time by means of timer interrupts. Interrupts are covered in detail in Chapter 10.

定时器中断是由产生的 系统的定时硬件定期;该间隔在引导时由内核根据 的值进行编程,该值是<linux/param.h>或其包含的子平台文件HZ中定义的与体系结构相关的值。分布式内核源代码中的默认值在真实硬件上为每秒 50 到 1200 个刻度,而在软件模拟器上则为 24 个刻度。大多数平台以每秒 100 或 1000 个中断的速度运行;流行的 x86 PC 默认为 1000,尽管在以前的版本(直到并包括 2.4)中它曾经是 100。作为一般规则,即使您知道 的值,在编程时也不应该依赖该特定值。HZ

Timer interrupts are generated by the system's timing hardware at regular intervals; this interval is programmed at boot time by the kernel according to the value of HZ, which is an architecture-dependent value defined in <linux/param.h> or a subplatform file included by it. Default values in the distributed kernel source range from 50 to 1200 ticks per second on real hardware, down to 24 for software simulators. Most platforms run at 100 or 1000 interrupts per second; the popular x86 PC defaults to 1000, although it used to be 100 in previous versions (up to and including 2.4). As a general rule, even if you know the value of HZ, you should never count on that specific value when programming.

HZ对于那些想要具有不同时钟中断频率的系统的人来说,可以更改 的值。如果更改HZ头文件,则需要使用新值重新编译内核和所有模块。HZ如果您愿意支付额外的计时器中断的开销来实现您的目标,您可能希望提高异步任务的粒度,以获得更细粒度的解决方案。实际上,HZ对于使用 2.4 或 2.2 版内核的 x86 工业系统,提升到 1000 是很常见的。然而,在当前版本中,定时器中断的最佳方法是保留默认值HZ,凭借我们对内核开发人员的完全信任,他们当然选择了最有价值的。HZ此外,一些内部计算目前仅在 12 到 1535 的范围内实现(参见<linux/timex.h>和 RFC-1589)。

It is possible to change the value of HZ for those who want systems with a different clock interrupt frequency. If you change HZ in the header file, you need to recompile the kernel and all modules with the new value. You might want to raise HZ to get a more fine-grained resolution in your asynchronous tasks, if you are willing to pay the overhead of the extra timer interrupts to achieve your goals. Actually, raising HZ to 1000 was pretty common with x86 industrial systems using Version 2.4 or 2.2 of the kernel. With current versions, however, the best approach to the timer interrupt is to keep the default value for HZ, by virtue of our complete trust in the kernel developers, who have certainly chosen the best value. Besides, some internal calculations are currently implemented only for HZ in the range from 12 to 1535 (see <linux/timex.h> and RFC-1589).

每次发生定时器中断时,内部内核计数器的值都会递增。该计数器在系统启动时初始化0,因此它代表自上次启动以来的时钟滴答数。计数器是一个 64 位变量(即使在 32 位体系结构上),称为jiffies_64。然而,驱动程序编写者通常访问该jiffies变量,unsigned long该变量与其最低有效位相同jiffies_64。使用jiffies通常是首选,因为它速度更快,并且对 64 位 jiffies_64值的访问不一定在所有体系结构上都是原子的。

Every time a timer interrupt occurs, the value of an internal kernel counter is incremented. The counter is initialized to 0 at system boot, so it represents the number of clock ticks since last boot. The counter is a 64-bit variable (even on 32-bit architectures) and is called jiffies_64. However, driver writers normally access the jiffies variable, an unsigned long that is the same as either jiffies_64 or its least significant bits. Using jiffies is usually preferred because it is faster, and accesses to the 64-bit jiffies_64 value are not necessarily atomic on all architectures.

除了低分辨率内核管理的 jiffy 机制之外,某些 CPU 平台还具有软件可以读取的高分辨率计数器。尽管它的实际使用在不同平台上有所不同,但有时它是一个非常强大的工具。

In addition to the low-resolution kernel-managed jiffy mechanism, some CPU platforms feature a high-resolution counter that software can read. Although its actual use varies somewhat across platforms, it's sometimes a very powerful tool.

使用 jiffies 计数器

Using the jiffies Counter

计数器和实用程序函数在<linux/jiffies.h>中实时读取它,尽管您通常只包含<linux/sched.h>,它会自动拉入jiffies.h。不用说,两者jiffies和 都jiffies_64必须是被视为只读。

The counter and the utility functions to read it live in <linux/jiffies.h>, although you'll usually just include <linux/sched.h>, that automatically pulls jiffies.h in. Needless to say, both jiffies and jiffies_64 must be considered read-only.

每当您的代码需要记住 的当前值时jiffies,它可以简单地访问该unsigned long变量,该变量被声明为 volatile 以告诉编译器不要优化内存读取。每当您的代码需要计算未来时间戳时,您都需要读取当前计数器,如下例所示:

Whenever your code needs to remember the current value of jiffies, it can simply access the unsigned long variable, which is declared as volatile to tell the compiler not to optimize memory reads. You need to read the current counter whenever your code needs to calculate a future time stamp, as shown in the following example:

#include <linux/jiffies.h>
无符号长 j、stamp_1、stamp_half、stamp_n;

j = jiffies;/* 读取当前值 */
stamp_1 = j + HZ; /* 未来1秒 */
stamp_half = j + HZ/2;/* 半秒 */
stamp_n = j + n * HZ / 1000;/* n 毫秒 */
#include <linux/jiffies.h>
unsigned long j, stamp_1, stamp_half, stamp_n;

j = jiffies;                      /* read the current value */
stamp_1    = j + HZ;              /* 1 second in the future */
stamp_half = j + HZ/2;            /* half a second */
stamp_n    = j + n * HZ / 1000;   /* n milliseconds */

jiffies只要以正确的方式比较不同的值,此代码就不会出现回绕问题。尽管在 32 位平台上,计数器每 50 天才循环一次(即HZ1000),但您的代码应该准备好面对该事件。要比较缓存值(如上stamp_1)和当前值,您应该使用以下宏之一:

This code has no problem with jiffies wrapping around, as long as different values are compared in the right way. Even though on 32-bit platforms the counter wraps around only once every 50 days when HZ is 1000, your code should be prepared to face that event. To compare your cached value (like stamp_1 above) and the current value, you should use one of the following macros:

#include <linux/jiffies.h>
int time_after(无符号长a,无符号长b);
int time_before(无符号长a, 无符号长b);
int time_after_eq(无符号长a,无符号长b);
int time_before_eq(无符号长a,无符号长b);
#include <linux/jiffies.h>
int time_after(unsigned long a, unsigned long b);
int time_before(unsigned long a, unsigned long b);
int time_after_eq(unsigned long a, unsigned long b);
int time_before_eq(unsigned long a, unsigned long b);

a作为 的快照表示bjiffies之后的时间时,第一个计算结果为 true ;当时间a在时间 b之前时,第二个计算结果为 true;最后两个比较“之后或等于”和“之前或等于”。该代码的工作原理是将这些值转换为有符号长整型,减去它们,然后比较结果。如果您需要以安全的方式了解两个实例之间的差异,您可以使用相同的技巧: 。jiffiesdiff = (long)t2 - (long)t1;

The first evaluates true when a, as a snapshot of jiffies, represents a time after b, the second evaluates true when time a is before time b, and the last two compare for "after or equal" and "before or equal." The code works by converting the values to signed long, subtracting them, and comparing the result. If you need to know the difference between two instances of jiffies in a safe way, you can use the same trick: diff = (long)t2 - (long)t1;.

您可以通过以下方式将 jiffies 差异轻松转换为毫秒:

You can convert a jiffies difference to milliseconds trivially through:

毫秒 = 差异 * 1000 / 赫兹;
msec = diff * 1000 / HZ;

然而,有时,您需要与用户空间程序交换时间表示,这些程序往往用 和 表示时间struct timevalstruct timespec。这两种结构用两个数字表示精确的时间量:旧的和流行的使用秒和微秒 struct timeval,新的使用秒和纳秒struct timespec。内核导出四个辅助函数,以将表示为 jiffies 的时间值与这些结构相互转换:

Sometimes, however, you need to exchange time representations with user space programs that tend to represent time values with struct timeval and struct timespec. The two structures represent a precise time quantity with two numbers: seconds and microseconds are used in the older and popular struct timeval, and seconds and nanoseconds are used in the newer struct timespec. The kernel exports four helper functions to convert time values expressed as jiffies to and from those structures:

#include <linux/time.h>

无符号长 timespec_to_jiffies(struct timespec *value);
void jiffies_to_timespec(unsigned long jiffies, struct timespec *value);
无符号长 timeval_to_jiffies(struct timeval *value);
void jiffies_to_timeval(unsigned long jiffies, struct timeval *value);
#include <linux/time.h>

unsigned long timespec_to_jiffies(struct timespec *value);
void jiffies_to_timespec(unsigned long jiffies, struct timespec *value);
unsigned long timeval_to_jiffies(struct timeval *value);
void jiffies_to_timeval(unsigned long jiffies, struct timeval *value);

访问 64 位 jiffy 计数并不像访问 那样简单jiffies。虽然在 64 位计算机体系结构上,这两个变量实际上是一个,但对于 32 位处理器来说,对该值的访问不是原子的。这意味着如果在读取变量时变量的两半都更新了,您可能会读取到错误的值。您极不可能需要读取 64 位计数器,但如果您这样做,您会很高兴知道内核导出了一个特定的帮助函数来为您执行正确的锁定:

Accessing the 64-bit jiffy count is not as straightforward as accessing jiffies. While on 64-bit computer architectures the two variables are actually one, access to the value is not atomic for 32-bit processors. This means you might read the wrong value if both halves of the variable get updated while you are reading them. It's extremely unlikely you'll ever need to read the 64-bit counter, but in case you do, you'll be glad to know that the kernel exports a specific helper function that does the proper locking for you:

#include <linux/jiffies.h>
u64 get_jiffies_64(void);
#include <linux/jiffies.h>
u64 get_jiffies_64(void);

在上面的原型中,u64使用了类型。这是<linux/types.h>定义的类型之一 ,表示无符号 64 位类型。

In the above prototype, the u64 type is used. This is one of the types defined by <linux/types.h> and represents an unsigned 64-bit type.

如果您想知道 32 位平台如何同时更新 32 位和 64 位计数器,请阅读您平台的链接器脚本(查找名称与 vmlinux*.lds* 匹配的文件。在那里,jiffies根据平台是小端还是大端,该符号被定义为访问 64 位值的最低有效字。实际上,同样的技巧也适用于 64 位平台,因此unsigned longu64 变量在同一地址访问。

If you're wondering how 32-bit platforms update both the 32-bit and 64-bit counters at the same time, read the linker script for your platform (look for a file whose name matches vmlinux*.lds*). There, the jiffies symbol is defined to access the least significant word of the 64-bit value, according to whether the platform is little-endian or big-endian. Actually, the same trick is used for 64-bit platforms, so that the unsigned long and u64 variables are accessed at the same address.

最后,请注意,实际时钟频率几乎完全对用户空间隐藏。HZ当用户空间程序包含param.h时,宏始终扩展为 100 ,并且报告给用户空间的每个计数器都会相应转换。这适用于 Clock(3)times(2)和任何相关函数。该值的用户可用的唯一证据HZ是定时器中断发生的速度有多快,如/proc/interrupts中所示。例如,您可以HZ通过将此计数除以/proc/uptime中报告的系统正常运行时间来获得。

Finally, note that the actual clock frequency is almost completely hidden from user space. The macro HZ always expands to 100 when user-space programs include param.h, and every counter reported to user space is converted accordingly. This applies to clock(3), times(2), and any related function. The only evidence available to users of the HZ value is how fast timer interrupts happen, as shown in /proc/interrupts. For example, you can obtain HZ by dividing this count by the system uptime reported in /proc/uptime.

处理器特定寄存器

Processor-Specific Registers

如果您需要测量 非常短的时间间隔或者您需要极高的图形精度,您可以求助于平台相关的资源,选择精度而不是可移植性。

If you need to measure very short time intervals or you need extremely high precision in your figures, you can resort to platform-dependent resources, a choice of precision over portability.

在现代处理器中,由于高速缓存、指令调度和分支预测,大多数 CPU 设计中指令时序固有的不可预测性阻碍了对经验性能数据的迫切需求。作为回应,CPU 制造商引入了一种计算时钟周期的方法,作为测量时间流逝的简单可靠的方法。因此,大多数现代处理器都包含一个计数器寄存器,该计数器寄存器在每个时钟周期稳定递增一次。如今,该时钟计数器是执行高分辨率计时任务的唯一可靠方式。

In modern processors, the pressing demand for empirical performance figures is thwarted by the intrinsic unpredictability of instruction timing in most CPU designs due to cache memories, instruction scheduling, and branch prediction. As a response, CPU manufacturers introduced a way to count clock cycles as an easy and reliable way to measure time lapses. Therefore, most modern processors include a counter register that is steadily incremented once at each clock cycle. Nowadays, this clock counter is the only reliable way to carry out high-resolution timekeeping tasks.

细节因平台而异:寄存器可能从用户空间可读,也可能不可读,可能可写,也可能不可写,并且可能是 64 或 32 位宽。在最后一种情况下,您必须准备好处理溢出,就像我们处理 jiffy 计数器一样。您的平台甚至可能不存在该寄存器,或者如果 CPU 缺乏该功能并且您正在处理专用计算机,则硬件设计者可以在外部设备中实现该寄存器。

The details differ from platform to platform: the register may or may not be readable from user space, it may or may not be writable, and it may be 64 or 32 bits wide. In the last case, you must be prepared to handle overflows just like we did with the jiffy counter. The register may even not exist for your platform, or it can be implemented in an external device by the hardware designer, if the CPU lacks the feature and you are dealing with a special-purpose computer.

无论寄存器是否可以清零,我们都强烈建议不要重置它,即使硬件允许也是如此。毕竟,在任何给定时间,您可能不是柜台的唯一用户;例如,在某些支持 SMP 的平台上,内核依赖于这样的计数器在处理器之间进行同步。由于您始终可以测量值之间的差异,只要该差异不超过溢出时间,您就可以通过修改寄存器的当前值来完成工作,而无需声明寄存器的独占所有权。

Whether or not the register can be zeroed, we strongly discourage resetting it, even when hardware permits. You might not, after all, be the only user of the counter at any given time; on some platforms supporting SMP, for example, the kernel depends on such a counter to be synchronized across processors. Since you can always measure differences between values, as long as that difference doesn't exceed the overflow time, you can get the work done without claiming exclusive ownership of the register by modifying its current value.

最著名的计数器寄存器是 TSC(时间戳计数器),它在 Pentium 的 x86 处理器中引入,并从此出现在所有 CPU 设计中,包括 x86_64 平台。它是一个64位寄存器,用于计数CPU时钟周期;它可以从内核空间和用户空间读取。

The most renowned counter register is the TSC (timestamp counter), introduced in x86 processors with the Pentium and present in all CPU designs ever since—including the x86_64 platform. It is a 64-bit register that counts CPU clock cycles; it can be read from both kernel space and user space.

包含<asm/msr.h>(x86 特定标头,其名称代表“机器特定寄存器”)后,您可以使用以下宏之一:

After including <asm/msr.h> (an x86-specific header whose name stands for "machine-specific registers"), you can use one of these macros:

rdtsc(低32,高32);
rdtscl(低32);
rdtscl(var64);
rdtsc(low32,high32);
rdtscl(low32);
rdtscll(var64);

第一个宏以原子方式将 64 位值读取到两个 32 位变量中;下一个(“读取低半部分”)将寄存器的低半部分读取到 32 位变量中,丢弃高半部分;最后一个将 64 位值读入long long变量,因此得名。所有这些宏都将值存储到它们的参数中。

The first macro atomically reads the 64-bit value into two 32-bit variables; the next one ("read low half") reads the low half of the register into a 32-bit variable, discarding the high half; the last reads the 64-bit value into a long long variable, hence, the name. All of these macros store values into their arguments.

对于 TSC 的大多数常见用途来说,读取计数器的低半部分就足够了。1 GHz CPU 每 4.2 秒仅溢出一次,因此如果您可靠地进行基准测试的时间间隔花费的时间较少,则无需处理多寄存器变量。然而,随着 CPU 频率随着时间的推移而上升以及时序要求的增加,您将来很可能需要更频繁地读取 64 位计数器。

Reading the low half of the counter is enough for most common uses of the TSC. A 1-GHz CPU overflows it only once every 4.2 seconds, so you won't need to deal with multiregister variables if the time lapse you are benchmarking reliably takes less time. However, as CPU frequencies rise over time and as timing requirements increase, you'll most likely need to read the 64-bit counter more often in the future.

作为仅使用寄存器的低半部分的示例,以下几行测量指令本身的执行情况:

As an example using only the low half of the register, the following lines measure the execution of the instruction itself:

无符号长 ini,结束;
rdtscl(ini); rdtscl(结束);
printk("时间流逝: %li\n", end - ini);
unsigned long ini, end;
rdtscl(ini); rdtscl(end);
printk("time lapse: %li\n", end - ini);

其他一些平台提供类似的功能,并且内核头提供了一个独立于体系结构的函数,您可以使用它来代替rdtsc。它称为get_cycles ,在<asm/timex.h>中定义(包含在<linux/timex.h>中)。它的原型是:

Some of the other platforms offer similar functionality, and kernel headers offer an architecture-independent function that you can use instead of rdtsc. It is called get_cycles, defined in <asm/timex.h> (included by <linux/timex.h>). Its prototype is:

#include <linux/timex.h>
 Cycles_t get_cycles(void);
 #include <linux/timex.h>
 cycles_t get_cycles(void);

该函数是为每个平台定义的,并且它总是0在没有周期计数器寄存器的平台上返回。该cycles_t类型是适当的无符号类型来保存读取的值。

This function is defined for every platform, and it always returns 0 on the platforms that have no cycle-counter register. The cycles_t type is an appropriate unsigned type to hold the value read.

尽管可以使用独立于体系结构的函数,但我们还是想借此机会展示内联汇编代码的示例。为此,我们实施了rdtscl MIPS 处理器的功能,其工作方式与 x86 处理器相同。

Despite the availability of an architecture-independent function, we'd like to take the opportunity to show an example of inline assembly code. To this aim, we implement a rdtscl function for MIPS processors that works in the same way as the x86 one.

我们以 MIPS 为基础来举例,因为大多数 MIPS 处理器都采用 32 位计数器作为其内部“协处理器 0”的寄存器 9。要访问只能从内核空间读取的寄存器,您可以定义以下宏来执行“从协处理器 0 移动”汇编指令:[ 1 ]

We base the example on MIPS because most MIPS processors feature a 32-bit counter as register 9 of their internal "coprocessor 0." To access the register, readable only from kernel space, you can define the following macro that executes a "move from coprocessor 0" assembly instruction:[1]

#define rdtscl(目标) \
   _ _asm_ _ _ _volatile_ _("mfc0 %0,$9; nop" : "=r" (dest))
#define rdtscl(dest) \
   _ _asm_ _ _ _volatile_ _("mfc0 %0,$9; nop" : "=r" (dest))

有了这个宏,MIPS 处理器就可以执行之前为 x86 显示的相同代码。

With this macro in place, the MIPS processor can execute the same code shown earlier for the x86.

对于 gcc内联汇编,通用寄存器的分配由编译器完成。刚刚显示的宏用作%0“参数 0”的占位符,稍后将其指定为“r用作输出 ( ) 的任何寄存器 ( =)”。该宏还规定输出寄存器必须对应于 C 表达式dest。内联汇编的语法非常强大,但有些复杂,特别是对于每个寄存器可以执行的操作有限制的体系结构(即 x86 系列)。gcc文档中描述了语法 ,通常可以在信息文档树中找到。

With gcc inline assembly, the allocation of general-purpose registers is left to the compiler. The macro just shown uses %0 as a placeholder for "argument 0," which is later specified as "any register (r) used as output (=)." The macro also states that the output register must correspond to the C expression dest. The syntax for inline assembly is very powerful but somewhat complex, especially for architectures that have constraints on what each register can do (namely, the x86 family). The syntax is described in the gcc documentation, usually available in the info documentation tree.

本节中显示的简短 C 代码片段已在 K7 级 x86 处理器和 MIPS VR4181 上运行(使用刚刚描述的宏)。前者报告了 11 个时钟周期的时间流逝,而后者仅报告了 2 个时钟周期。这个数字很小是预料之中的,因为 RISC 处理器通常每个时钟周期执行一条指令。

The short C-code fragment shown in this section has been run on a K7-class x86 processor and a MIPS VR4181 (using the macro just described). The former reported a time lapse of 11 clock ticks and the latter just 2 clock ticks. The small figure was expected, since RISC processors usually execute one instruction per clock cycle.

关于时间戳计数器还有另一件事值得了解:它们不一定在 SMP 系统中的处理器之间同步。为了确保获得一致的值,您应该禁用抢占 用于查询计数器的代码。

There is one other thing worth knowing about timestamp counters: they are not necessarily synchronized across processors in an SMP system. To be sure of getting a coherent value, you should disable preemption for code that is querying the counter.

了解当前时间

Knowing the Current Time

内核代码始终可以通过查看 的值来检索当前时间的表示jiffies。通常,该值仅表示自上次启动以来的时间这一事实与驱动程序无关,因为其寿命仅限于系统正常运行时间。如图所示,驱动程序可以使用 的当前值jiffies来计算事件之间的时间间隔(例如,区分输入设备驱动程序中的双击和单击或计算超时)。简而言之,jiffies当您需要测量时间间隔时,查看几乎总是足够的。如果您需要在短时间内进行非常精确的测量,则特定于处理器的寄存器可以帮助您(尽管它们会带来严重的可移植性问题)。

Kernel code can always retrieve a representation of the current time by looking at the value of jiffies. Usually, the fact that the value represents only the time since the last boot is not relevant to the driver, because its life is limited to the system uptime. As shown, drivers can use the current value of jiffies to calculate time intervals across events (for example, to tell double-clicks from single-clicks in input device drivers or calculate timeouts). In short, looking at jiffies is almost always sufficient when you need to measure time intervals. If you need very precise measurements for short time lapses, processor-specific registers come to the rescue (although they bring in serious portability issues).

驾驶员不太可能需要知道以月、日和小时表示的挂钟时间;通常只有cronsyslogd等用户程序需要这些信息 。处理现实世界的时间通常最好留给用户空间,C 库在其中提供更好的支持;此外,此类代码通常与策略相关性太强,不适合放入内核中。然而,有一个内核函数可以将挂钟时间转换为一个jiffies值:

It's quite unlikely that a driver will ever need to know the wall-clock time, expressed in months, days, and hours; the information is usually needed only by user programs such as cron and syslogd. Dealing with real-world time is usually best left to user space, where the C library offers better support; besides, such code is often too policy-related to belong in the kernel. There is a kernel function that turns a wall-clock time into a jiffies value, however:

#include <linux/time.h>
无符号长 mktime(无符号整型年份,无符号整型星期一,
                      无符号整型日、无符号整型小时、
                      无符号整型分钟、无符号整型秒);
#include <linux/time.h>
unsigned long mktime (unsigned int year, unsigned int mon,
                      unsigned int day, unsigned int hour,
                      unsigned int min, unsigned int sec);

重复一遍:直接处理驱动程序中的挂钟时间通常是政策正在实施的标志,因此应该受到质疑。

To repeat: dealing directly with wall-clock time in a driver is often a sign that policy is being implemented and should therefore be questioned.

虽然您不必处理人类可读的时间表示,但有时即使在内核空间中,您也需要处理绝对时间戳。为此,< linux/time.h>导出do_gettimeofday函数。调用时,它会用熟悉的秒和微秒值填充指针(与gettimeofdaystruct timeval系统调用中使用的指针相同) 。do_gettimeofday的原型是:

While you won't have to deal with human-readable representations of the time, sometimes you need to deal with absolute timestamp even in kernel space. To this aim, <linux/time.h> exports the do_gettimeofday function. When called, it fills a struct timeval pointer—the same one used in the gettimeofday system call—with the familiar seconds and microseconds values. The prototype for do_gettimeofday is:

#include <linux/time.h>
 无效 do_gettimeofday(struct timeval *tv);
 #include <linux/time.h>
 void do_gettimeofday(struct timeval *tv);

消息来源指出do_gettimeofday具有“接近微秒的分辨率”,因为它询问计时硬件当前 jiffy 的哪一部分已经过去。然而,精度因一种架构而异,因为它取决于所使用的实际硬件机制。例如,某些m68knommu处理器、Sun3 系统和其他m68k系统无法提供超过 jiffy 的分辨率。另一方面,奔腾系统通过读取本章前面描述的时间戳计数器来提供非常快速和精确的子刻度测量。

The source states that do_gettimeofday has "near microsecond resolution," because it asks the timing hardware what fraction of the current jiffy has already elapsed. The precision varies from one architecture to another, however, since it depends on the actual hardware mechanisms in use. For example, some m68knommu processors, Sun3 systems, and other m68k systems cannot offer more than jiffy resolution. Pentium systems, on the other hand, offer very fast and precise subtick measures by reading the timestamp counter described earlier in this chapter.

当前时间也可用(虽然粒度很快)来自xtime变量,一个struct timespec值。不鼓励直接使用此变量,因为很难以原子方式访问这两个字段。因此,内核提供了实用函数current_kernel_time

The current time is also available (though with jiffy granularity) from the xtime variable, a struct timespec value. Direct use of this variable is discouraged because it is difficult to atomically access both the fields. Therefore, the kernel offers the utility function current_kernel_time:

#include <linux/time.h>
struct timespec current_kernel_time(void);
#include <linux/time.h>
struct timespec current_kernel_time(void);

用于以各种方式检索当前时间的代码可在 O'Reilly 的 FTP 站点上提供的源文件的jit(“及时”)模块中找到。jit创建一个名为/proc/currentime的文件,该文件在读取时以 ASCII 格式返回以下项目:

Code for retrieving the current time in the various ways it is available within the jit ("just in time") module in the source files provided on O'Reilly's FTP site. jit creates a file called /proc/currentime, which returns the following items in ASCII when read:

  • 十六进制数字形式的电流jiffies和值jiffies_64

  • The current jiffies and jiffies_64 values as hex numbers

  • do_gettimeofday返回的当前时间

  • The current time as returned by do_gettimeofday

  • current_kernel_time返回timespec

  • The timespec returned by current_kernel_time

我们选择使用动态/proc文件来将样板代码保持在最低限度——仅仅为了返回一点文本信息而创建整个设备是不值得的。

We chose to use a dynamic /proc file to keep the boilerplate code to a minimum—it's not worth creating a whole device just to return a little textual information.

只要模块加载,文件就会连续返回文本行;每个 读取系统调用收集并返回一组数据,将其组织为两行以提高可读性。每当您在不到计时器滴答声的时间内读取多个数据集时,您都会看到查询硬件的do_gettimeofday与仅在计时器滴答声时更新的其他值之间的差异。

The file returns text lines continuously as long as the module is loaded; each read system call collects and returns one set of data, organized in two lines for better readability. Whenever you read multiple data sets in less than a timer tick, you'll see the difference between do_gettimeofday, which queries the hardware, and the other values that are updated only when the timer ticks.

电话%head -8 /proc/currentime
0x00bdbc1f 0x0000000100bdbc1f 1062370899.630126
                              1062370899.629161488
0x00bdbc1f 0x0000000100bdbc1f 1062370899.630150
                              1062370899.629161488
0x00bdbc20 0x0000000100bdbc20 1062370899.630208
                              1062370899.630161336
0x00bdbc20 0x0000000100bdbc20 1062370899.630233
                              1062370899.630161336
phon% head -8 /proc/currentime
0x00bdbc1f 0x0000000100bdbc1f 1062370899.630126
                              1062370899.629161488
0x00bdbc1f 0x0000000100bdbc1f 1062370899.630150
                              1062370899.629161488
0x00bdbc20 0x0000000100bdbc20 1062370899.630208
                              1062370899.630161336
0x00bdbc20 0x0000000100bdbc20 1062370899.630233
                              1062370899.630161336

在上面的屏幕截图中,有两件有趣的事情需要注意。首先, current_kernel_time值虽然以纳秒表示,但只有时钟周期粒度;do_gettimeofday始终报告较晚的时间,但不晚于下一个计时器计时。其次,64 位 jiffies 计数器具有高 32 位字集的最低有效位。发生这种情况的原因是,在启动时用于初始化计数器的默认值INITIAL_JIFFIES会在启动后几分钟强制发生低字溢出,以帮助检测与该溢出相关的问题。计数器中的初始偏差没有影响,因为 jiffies与挂钟时间无关。在/proc/uptime中,内核从中提取正常运行时间 计数器,在转换之前消除初始偏置。

In the screenshot above, there are two interesting things to note. First, the current_kernel_time value, though expressed in nanoseconds, has only clock-tick granularity; do_gettimeofday consistently reports a later time but not later than the next timer tick. Second, the 64-bit jiffies counter has the least-significant bit of the upper 32-bit word set. This happens because the default value for INITIAL_JIFFIES, used at boot time to initialize the counter, forces a low-word overflow a few minutes after boot time to help detect problems related to that very overflow. This initial bias in the counter has no effect, because jiffies is unrelated to wall-clock time. In /proc/uptime, where the kernel extracts the uptime from the counter, the initial bias is removed before conversion.

延迟执行

Delaying Execution

设备驱动程序经常 需要将特定代码段的执行延迟一段时间,通常是为了让硬件完成某些任务。在本节中,我们将介绍许多实现延迟的不同技术。每种情况的情况决定了最好使用哪种技术;我们会仔细研究它们,并指出每个的优点和缺点。

Device drivers often need to delay the execution of a particular piece of code for a period of time, usually to allow the hardware to accomplish some task. In this section we cover a number of different techniques for achieving delays. The circumstances of each situation determine which technique is best to use; we go over them all, and point out the advantages and disadvantages of each.

HZ需要考虑的一件重要事情是,考虑到各个平台的范围,您需要的延迟与时钟周期相比如何。可靠地长于时钟周期并且不受其粗粒度影响的延迟可以利用系统时钟。非常短的延迟通常必须通过软件循环来实现。在这两种情况之间存在一个灰色地带。在本章中,我们使用短语“长”延迟来指代多 jiffy 延迟,在某些平台上可以低至几毫秒,但对于 CPU 和内核来说仍然很长。

One important thing to consider is how the delay you need compares with the clock tick, considering the range of HZ across the various platforms. Delays that are reliably longer than the clock tick, and don't suffer from its coarse granularity, can make use of the system clock. Very short delays typically must be implemented with software loops. In between these two cases lies a gray area. In this chapter, we use the phrase "long" delay to refer to a multiple-jiffy delay, which can be as low as a few milliseconds on some platforms, but is still long as seen by the CPU and the kernel.

以下各节通过从各种直观但不适当的解决方案到正确的解决方案的较长路径来讨论不同的延迟。我们选择这条路径是因为它允许更深入地讨论与时序相关的内核问题。如果您渴望找到正确的代码,只需浏览本节即可。

The following sections talk about the different delays by taking a somewhat long path from various intuitive but inappropriate solutions to the right solution. We chose this path because it allows a more in-depth discussion of kernel issues related to timing. If you are eager to find the right code, just skim through the section.

长时间延误

Long Delays

有时候,司机需要 将执行延迟相对较长的时间——超过一个时钟周期。有几种方法可以实现这种延迟;我们从最简单的技术开始,然后继续使用更高级的技术。

Occasionally a driver needs to delay execution for relatively long periods—more than one clock tick. There are a few ways of accomplishing this sort of delay; we start with the simplest technique, then proceed to the more advanced techniques.

忙碌等待

Busy waiting

如果您想延迟执行多个时钟周期,从而允许值有一些松弛,最简单(尽管不推荐)的实现是监视 jiffy 计数器的循环。忙碌的等待 实现通常类似于以下代码,其中是延迟到期时j1的值:jiffies

If you want to delay execution by a multiple of the clock tick, allowing some slack in the value, the easiest (though not recommended) implementation is a loop that monitors the jiffy counter. The busy-waiting implementation usually looks like the following code, where j1 is the value of jiffies at the expiration of the delay:

while (time_before(jiffies, j1))
    cpu_relax();
while (time_before(jiffies, j1))
    cpu_relax(  );

对cpu_relax的调用调用了一种特定于体系结构的方式,表示您目前没有对处理器进行太多操作。在许多系统上它什么也不做;在对称多线程(“超线程”)系统上,它可能会将核心让给其他线程。无论如何,只要有可能,绝对应该避免这种方法。我们在这里展示它是因为有时您可能希望运行此代码以更好地理解其他代码的内部结构。

The call to cpu_relax invokes an architecture-specific way of saying that you're not doing much with the processor at the moment. On many systems it does nothing at all; on symmetric multithreaded ("hyperthreaded") systems, it may yield the core to the other thread. In any case, this approach should definitely be avoided whenever possible. We show it here because on occasion you might want to run this code to better understand the internals of other code.

那么让我们看看这段代码是如何工作的。该循环可以保证正常工作,因为它是由内核标头jiffies声明的,因此只要某些 C 代码访问它,就会从内存中获取该循环。volatile尽管在技术上是正确的(因为它按设计工作),但这种繁忙的循环严重降低了系统性能。如果您没有将内核配置为抢占式操作,则循环会在延迟期间完全锁定处理器;调度程序永远不会抢占在内核空间中运行的进程,并且计算机看起来完全死机直到时间j1到达了。如果您运行的是抢占式内核,那么问题就不那么严重,因为除非代码持有锁,否则部分处理器时间可以回收用于其他用途。然而,在抢占式系统上,繁忙等待的成本仍然很高。

So let's look at how this code works. The loop is guaranteed to work because jiffies is declared as volatile by the kernel headers and, therefore, is fetched from memory any time some C code accesses it. Although technically correct (in that it works as designed), this busy loop severely degrades system performance. If you didn't configure your kernel for preemptive operation, the loop completely locks the processor for the duration of the delay; the scheduler never preempts a process that is running in kernel space, and the computer looks completely dead until time j1 is reached. The problem is less serious if you are running a preemptive kernel, because, unless the code is holding a lock, some of the processor's time can be recovered for other uses. Busy waits are still expensive on preemptive systems, however.

更糟糕的是,如果进入循环时中断恰好被禁用,则jiffies不会更新,并且while条件永远保持为真。运行抢占式内核也无济于事,您将被迫点击红色大按钮。

Still worse, if interrupts happen to be disabled when you enter the loop, jiffies won't be updated, and the while condition remains true forever. Running a preemptive kernel won't help either, and you'll be forced to hit the big red button.

延迟代码的这种实现在jit模块中可用,如下所示 。每次读取一行文本时,模块创建的/proc/jit* 文件都会延迟一秒,并且每行保证为 20 字节。如果您想测试忙等待代码,您可以阅读/proc/jitbusy,它返回的每一行都会忙循环一秒钟。

This implementation of delaying code is available, like the following ones, in the jit module. The /proc/jit* files created by the module delay a whole second each time you read a line of text, and lines are guaranteed to be 20 bytes each. If you want to test the busy-wait code, you can read /proc/jitbusy, which busy-loops for one second for each line it returns.

警告

Warning

确保一次最多从/proc/jitbusy读取一行(或几行) 。用于注册/proc文件的简化内核机制会一遍又一遍地调用 read方法来填充用户请求的数据缓冲区。因此,诸如cat /proc/jitbusy之类的命令,如果一次读取 4 KB,则会使计算机冻结 205 秒。

Be sure to read, at most, one line (or a few lines) at a time from /proc/jitbusy. The simplified kernel mechanism to register /proc files invokes the read method over and over to fill the data buffer the user requested. Therefore, a command such as cat /proc/jitbusy, if it reads 4 KB at a time, freezes the computer for 205 seconds.

读取/proc/jitbusy 的建议命令是dd bs=20 < /proc/jitbusy,也可以选择指定块数。文件返回的每个 20 字节行代表 jiffy 计数器在延迟之前和之后的值。这是在未加载的计算机上运行的示例:

The suggested command to read /proc/jitbusy is dd bs=20 < /proc/jitbusy, optionally specifying the number of blocks as well. Each 20-byte line returned by the file represents the value the jiffy counter had before and after the delay. This is a sample run on an otherwise unloaded computer:

电话%dd bs=20 count=5 < /proc/jitbusy
  1686518 1687518
  1687519 1688519
  1688520 1689520
  1689520 1690520
  1690521 1691521
phon% dd bs=20 count=5 < /proc/jitbusy
  1686518   1687518
  1687519   1688519
  1688520   1689520
  1689520   1690520
  1690521   1691521

一切看起来都不错:延迟正好是一秒(1000 jiffies),并且下一个 读取系统调用在前一个读取系统调用结束后立即开始。但让我们看看在运行大量 CPU 密集型进程(以及非抢占式内核)的系统上会发生什么:

All looks good: delays are exactly one second (1000 jiffies), and the next read system call starts immediately after the previous one is over. But let's see what happens on a system with a large number of CPU-intensive processes running (and nonpreemptive kernel):

电话%dd bs=20 count=5 < /proc/jitbusy
  1911226 1912226
  1913323 1914323
  1919529 1920529
  1925632 1926632
  1931835 1932835
phon% dd bs=20 count=5 < /proc/jitbusy
  1911226   1912226
  1913323   1914323
  1919529   1920529
  1925632   1926632
  1931835   1932835

在这里,每个read系统调用都会延迟一秒,但内核可能需要 5 秒以上才能调度dd 进程,以便它可以发出下一个系统调用。这是多任务系统所期望的;CPU 时间在所有正在运行的进程之间共享,并且 CPU 密集型进程的动态优先级会降低。(调度策略的讨论超出了本书的范围。)

Here, each read system call delays exactly one second, but the kernel can take more than 5 seconds before scheduling the dd process so it can issue the next system call. That's expected in a multitasking system; CPU time is shared between all running processes, and a CPU-intensive process has its dynamic priority reduced. (A discussion of scheduling policies is outside the scope of this book.)

上面所示的负载测试是在运行 load50示例程序时执行的。该程序分叉了许多不执行任何操作但以 CPU 密集型方式执行的进程。该程序是本书附带的示例文件的一部分,默认情况下分叉 50 个进程,尽管可以在命令行上指定该数量。在本章以及本书的其他地方,已加载系统的测试是在一台闲置计算机上运行的load50上执行的。

The test under load shown above has been performed while running the load50 sample program. This program forks a number of processes that do nothing, but do it in a CPU-intensive way. The program is part of the sample files accompanying this book, and forks 50 processes by default, although the number can be specified on the command line. In this chapter, and elsewhere in the book, the tests with a loaded system have been performed with load50 running in an otherwise idle computer.

如果在运行可抢占内核时重复该命令,您将发现空闲 CPU 上没有明显差异,并且在负载下出现以下行为:

If you repeat the command while running a preemptible kernel, you'll find no noticeable difference on an otherwise idle CPU and the following behavior under load:

电话%dd bs=20 count=5 < /proc/jitbusy
 14940680 14942777
 14942778 14945430
 14945431 14948491
 14948492 14951960
 14951961 14955840
phon% dd bs=20 count=5 < /proc/jitbusy
 14940680  14942777
 14942778  14945430
 14945431  14948491
 14948492  14951960
 14951961  14955840

此处,系统调用结束与下一个系统调用开始之间没有显着的延迟,但各个延迟远远超过一秒:在所示示例中长达 3.8 秒,并且随着时间的推移而增加。这些值表明该进程在其延迟期间已被中断,从而调度了其他进程。系统调用之间的间隙并不是该进程的唯一调度选项,因此在那里看不到特殊的延迟。

Here, there is no significant delay between the end of a system call and the beginning of the next one, but the individual delays are far longer than one second: up to 3.8 seconds in the example shown and increasing over time. These values demonstrate that the process has been interrupted during its delay, scheduling other processes. The gap between system calls is not the only scheduling option for this process, so no special delay can be seen there.

产生处理器

Yielding the processor

正如我们所看到的,繁忙的等待会给整个系统带来沉重的负担;我们希望找到更好的技术。我想到的第一个更改是当我们不感兴趣时​​显式释放 CPU。这是通过调用<linux/sched.h>中声明的调度函数来完成的:

As we have seen, busy waiting imposes a heavy load on the system as a whole; we would like to find a better technique. The first change that comes to mind is to explicitly release the CPU when we're not interested in it. This is accomplished by calling the schedule function, declared in <linux/sched.h>:

while (time_before(jiffies, j1)) {
    日程( );
}
while (time_before(jiffies, j1)) {
    schedule(  );
}

这个循环可以通过读取/proc/jitsched来测试,就像我们上面读取/proc/jitbusy一样。然而,这仍然不是最佳的。当前进程除了释放CPU外什么也不做,但仍保留在运行队列中。如果它是唯一可运行的进程,那么它实际上会运行(它调用调度程序,调度程序选择相同的进程,后者调用调度程序,......)。也就是说,机器的负载(平均运行进程数)至少为1,并且空闲任务(进程数0,也称为 swapper 由于历史原因)永远不会运行。尽管这个问题看似无关紧要,但在计算机空闲时运行空闲任务可以减轻处理器的工作负载,降低其温度并延长其使用寿命,如果计算机恰好是您的笔记本电脑,还可以延长电池的使用寿命。此外,由于进程实际上是在延迟期间执行的,因此它要对其消耗的所有时间负责。

This loop can be tested by reading /proc/jitsched as we read /proc/jitbusy above. However, is still isn't optimal. The current process does nothing but release the CPU, but it remains in the run queue. If it is the only runnable process, it actually runs (it calls the scheduler, which selects the same process, which calls the scheduler, which . . . ). In other words, the load of the machine (the average number of running processes) is at least one, and the idle task (process number 0, also called swapper for historical reasons) never runs. Though this issue may seem irrelevant, running the idle task when the computer is idle relieves the processor's workload, decreasing its temperature and increasing its lifetime, as well as the duration of the batteries if the computer happens to be your laptop. Moreover, since the process is actually executing during the delay, it is accountable for all the time it consumes.

/proc/jitsched的行为实际上类似于在抢占式内核下运行/proc/jitbusy 。这是在卸载的系统上运行的示例:

The behavior of /proc/jitsched is actually similar to running /proc/jitbusy under a preemptive kernel. This is a sample run, on an unloaded system:

电话%dd bs=20 count=5 < /proc/jitsched
  1760205 1761207
  1761209 1762211
  1762212 1763212
  1763213 1764213
  1764214 1765217
phon% dd bs=20 count=5 < /proc/jitsched
  1760205   1761207
  1761209   1762211
  1762212   1763212
  1763213   1764213
  1764214   1765217

有趣的是,每次读取有时都会比请求的时间多等待几个时钟周期。随着系统变得繁忙,这个问题会变得越来越严重,司机最终等待的时间可能会比预期的要长。一旦进程按照Schedule释放了处理器,就不能保证该进程会很快取回处理器。因此,以这种方式调用 调度除了对整个计算系统不利之外,并不是满足驾驶员需求的安全解决方案。如果您 在运行 load50时测试jitsched,您可以看到与每行相关的延迟延长了几秒,因为超时到期时其他进程正在使用 CPU。

It's interesting to note that each read sometimes ends up waiting a few clock ticks more than requested. This problem gets worse and worse as the system gets busy, and the driver could end up waiting longer than expected. Once a process releases the processor with schedule, there are no guarantees that the process will get the processor back anytime soon. Therefore, calling schedule in this manner is not a safe solution to the driver's needs, in addition to being bad for the computing system as a whole. If you test jitsched while running load50, you can see that the delay associated to each line is extended by a few seconds, because other processes are using the CPU when the timeout expires.

超时

Timeouts

到目前为止显示的次优延迟循环是通过观察 jiffy 计数器而不告诉任何人来工作的。但正如您想象的那样,实现延迟的最佳方法通常是要求内核为您执行此操作。有两种方法可以设置基于 jiffy 的超时,具体取决于您的驱动程序是否正在等待其他事件。

The suboptimal delay loops shown up to now work by watching the jiffy counter without telling anyone. But the best way to implement a delay, as you may imagine, is usually to ask the kernel to do it for you. There are two ways of setting up jiffy-based timeouts, depending on whether your driver is waiting for other events or not.

如果您的驱动程序使用等待队列来等待某些其他事件,但您还想确保它在特定时间段内运行,则可以使用 wait_event_timeoutwait_event_interruptible_timeout

If your driver uses a wait queue to wait for some other event, but you also want to be sure that it runs within a certain period of time, it can use wait_event_timeout or wait_event_interruptible_timeout:

#include <linux/wait.h>
长 wait_event_timeout(wait_queue_head_t q, 条件, 长超时);
长等待事件中断超时(wait_queue_head_t q,
                      条件,长时间超时);
#include <linux/wait.h>
long wait_event_timeout(wait_queue_head_t q, condition, long timeout);
long wait_event_interruptible_timeout(wait_queue_head_t q,
                      condition, long timeout);

这些函数在给定的等待队列上休眠,但它们在超时(以 jiffies 表示)到期后返回。因此,他们实施了不会永远持续下去的有限睡眠。请注意,超时值表示要等待的 jiffies 数,而不是绝对时间值。该值由带符号的数字表示,因为它有时是减法的结果,尽管 如果提供的超时为负数,函数会通过printk语句进行抱怨。如果超时,函数返回0; 如果进程被另一个事件唤醒,则返回以 jiffies 表示的剩余延迟。返回值永远不会为负,即使由于系统负载而导致延迟大于预期。

These functions sleep on the given wait queue, but they return after the timeout (expressed in jiffies) expires. Thus, they implement a bounded sleep that does not go on forever. Note that the timeout value represents the number of jiffies to wait, not an absolute time value. The value is represented by a signed number, because it sometimes is the result of a subtraction, although the functions complain through a printk statement if the provided timeout is negative. If the timeout expires, the functions return 0; if the process is awakened by another event, it returns the remaining delay expressed in jiffies. The return value is never negative, even if the delay is greater than expected because of system load.

/proc/jitqueue文件显示基于wait_event_interruptible_timeout 的延迟 ,尽管模块没有要等待的事件,并用作0条件:

The /proc/jitqueue file shows a delay based on wait_event_interruptible_timeout, although the module has no event to wait for, and uses 0 as a condition:

wait_queue_head_t等待;
init_waitqueue_head (&等待);
wait_event_interruptible_timeout(等待, 0, 延迟);
wait_queue_head_t wait;
init_waitqueue_head (&wait);
wait_event_interruptible_timeout(wait, 0, delay);

读取/proc/jitqueue时观察到的行为几乎是最佳的,即使在负载下也是如此:

The observed behaviour, when reading /proc/jitqueue, is nearly optimal, even under load:

电话%dd bs=20 count=5 < /proc/jitqueue
  2027024 2028024
  2028025 2029025
  2029026 2030026
  2030027 2031027
  2031028 2032028
phon% dd bs=20 count=5 < /proc/jitqueue
  2027024   2028024
  2028025   2029025
  2029026   2030026
  2030027   2031027
  2031028   2032028

由于等待超时时读取进程(上面的dd )不在运行队列中,因此无论代码是否在抢占式内核中运行,您都看不到行为上的差异。

Since the reading process (dd above) is not in the run queue while waiting for the timeout, you see no difference in behavior whether the code is run in a preemptive kernel or not.

wait_event_timeoutwait_event_interruptible_timeout在设计时考虑了硬件驱动程序,可以通过两种方式之一恢复执行:有人在等待队列上调用wake_up,或者超时到期。这不适用于jitqueue,因为没有人在等待队列上调用 wake_up(毕竟,没有其他代码知道它),因此当超时到期时,进程总是会被唤醒。为了适应这种情况,即您希望延迟执行而不等待特定事件,内核提供了schedule_timeout函数,这样您就可以避免声明和使用多余的等待队列头:

wait_event_timeout and wait_event_interruptible_timeout were designed with a hardware driver in mind, where execution could be resumed in either of two ways: either somebody calls wake_up on the wait queue, or the timeout expires. This doesn't apply to jitqueue, as nobody ever calls wake_up on the wait queue (after all, no other code even knows about it), so the process always wakes up when the timeout expires. To accommodate for this very situation, where you want to delay execution waiting for no specific event, the kernel offers the schedule_timeout function so you can avoid declaring and using a superfluous wait queue head:

#include <linux/sched.h>
签名长schedule_timeout(签名长超时);
#include <linux/sched.h>
signed long schedule_timeout(signed long timeout);

这里,timeout是要延迟的 jiffies 数。返回值是0除非函数在给定超时结束之前返回(响应信号)。 Schedule_timeout要求调用者首先设置当前进程状态,因此典型的调用如下所示:

Here, timeout is the number of jiffies to delay. The return value is 0 unless the function returns before the given timeout has elapsed (in response to a signal). schedule_timeout requires that the caller first set the current process state, so a typical call looks like:

set_current_state(TASK_INTERRUPTIBLE);
Schedule_timeout(延迟);
set_current_state(TASK_INTERRUPTIBLE);
schedule_timeout (delay);

前面的行(来自/proc/jitschedto)导致进程休眠,直到给定的时间过去。由于 wait_event_interruptible_timeout内部依赖于 Schedule_timeout ,因此我们不会费心显示jitschedto返回的数字 ,因为它们与jitqueue的数字相同 。再次值得注意的是,在超时到期和进程实际计划执行之间可能会经过额外的时间间隔。

The previous lines (from /proc/jitschedto ) cause the process to sleep until the given time has passed. Since wait_event_interruptible_timeout relies on schedule_timeout internally, we won't bother showing the numbers jitschedto returns, because they are the same as those of jitqueue. Once again, it is worth noting that an extra time interval could pass between the expiration of the timeout and when your process is actually scheduled to execute.

在刚刚显示的示例中,第一行调用 set_current_state进行设置,以便调度程序不会再次运行当前进程,直到超时将其恢复到TASK_RUNNING状态。要实现不间断的延迟,请 TASK_UNINTERRUPTIBLE改用。如果您忘记更改当前进程的状态,则调用schedule_timeout 的行为类似于调用schedule(即jitched行为),设置一个未使用的计时器。

In the example just shown, the first line calls set_current_state to set things up so that the scheduler won't run the current process again until the timeout places it back in TASK_RUNNING state. To achieve an uninterruptible delay, use TASK_UNINTERRUPTIBLE instead. If you forget to change the state of the current process, a call to schedule_timeout behaves like a call to schedule (i.e., the jitsched behavior), setting up a timer that is not used.

如果你想在不同的系统情况或不同的内核下玩弄这四个jit文件,或者尝试其他方式来延迟执行,你可能需要通过设置delay模块参数来配置加载模块时的延迟量 。

If you want to play with the four jit files under different system situations or different kernels, or try other ways to delay execution, you may want to configure the amount of the delay when loading the module by setting the delay module parameter.

短暂延误

Short Delays

当设备驱动程序需要处理延迟时在其硬件中,所涉及的延迟通常最多只有几十微秒。在这种情况下,依靠时钟滴答绝对不是可行的方法。

When a device driver needs to deal with latencies in its hardware, the delays involved are usually a few dozen microseconds at most. In this case, relying on the clock tick is definitely not the way to go.

内核函数ndelayudelaymdelay非常适合短延迟,分别将执行延迟指定的纳秒、微秒或毫秒数。[ 2 ]他们的原型是:

The kernel functions ndelay, udelay, and mdelay serve well for short delays, delaying execution for the specified number of nanoseconds, microseconds, or milliseconds respectively.[2] Their prototypes are:

#include <linux/delay.h>
void ndelay(无符号长纳秒);
void udelay(unsigned long usecs);
void mdelay(无符号长毫秒);
#include <linux/delay.h>
void ndelay(unsigned long nsecs);
void udelay(unsigned long usecs);
void mdelay(unsigned long msecs);

这些函数的实际实现位于<asm/delay.h>中,是特定于体系结构的,有时构建在外部函数上。每个架构都实现udelay,但其他功能可能已定义,也可能未定义;如果不是,<linux/delay.h>提供基于 udelay的默认版本。在所有情况下,实现的延迟至少是所请求的值,但也可能更高;实际上,目前还没有平台能够达到纳秒精度,尽管有几个平台可以提供亚微秒精度。延迟超过请求的值通常不是问题,因为驱动程序通常需要短延迟来等待硬件,并且要求至少等待给定的时间间隔。

The actual implementations of the functions are in <asm/delay.h>, being architecture-specific, and sometimes build on an external function. Every architecture implements udelay, but the other functions may or may not be defined; if they are not, <linux/delay.h> offers a default version based on udelay. In all cases, the delay achieved is at least the requested value but could be more; actually, no platform currently achieves nanosecond precision, although several ones offer submicrosecond precision. Delaying more than the requested value is usually not a problem, as short delays in a driver are usually needed to wait for the hardware, and the requirements are to wait for at least a given time lapse.

udelay的实现 (也可能是ndelay)使用基于启动时计算的处理器速度的软件循环,使用整数变量loops_per_jiffy。但是,如果您想查看实际代码,请注意x86实现相当复杂,因为根据运行代码的 CPU 类型,它使用不同的计时源。

The implementation of udelay (and possibly ndelay too) uses a software loop based on the processor speed calculated at boot time, using the integer variable loops_per_jiffy. If you want to look at the actual code, however, be aware that the x86 implementation is quite a complex one because of the different timing sources it uses, based on what CPU type is running the code.

为了避免循环计算中的整数溢出,udelayndelay对传递给它们的值施加上限。如果您的模块无法加载并显示未解析的符号__bad_udelay,则意味着您使用太大的参数调用udelay 。但请注意,编译时检查只能对常量值执行,并且并非所有平台都实现它。作为一般规则,如果您尝试延迟数千纳秒,则应该使用udelay而不是ndelay同样,毫秒级的延迟应该使用mdelay来完成 而不是更细粒度的函数之一。

To avoid integer overflows in loop calculations, udelay and ndelay impose an upper bound in the value passed to them. If your module fails to load and displays an unresolved symbol, _ _bad_udelay, it means you called udelay with too large an argument. Note, however, that the compile-time check can be performed only on constant values and that not all platforms implement it. As a general rule, if you are trying to delay for thousands of nanoseconds, you should be using udelay rather than ndelay; similarly, millisecond-scale delays should be done with mdelay and not one of the finer-grained functions.

重要的是要记住,三个延迟函数都是忙等待;在此时间内无法运行其他任务。因此,它们复制了jitbusy的行为,尽管规模不同。因此,只有在没有实际替代方案时才应使用这些功能。

It's important to remember that the three delay functions are busy-waiting; other tasks can't be run during the time lapse. Thus, they replicate, though on a different scale, the behavior of jitbusy. Thus, these functions should only be used when there is no practical alternative.

还有另一种方法可以实现毫秒(和更长)的延迟,但不涉及忙等待。文件<linux/delay.h> 声明了这些函数:

There is another way of achieving millisecond (and longer) delays that does not involve busy waiting. The file <linux/delay.h> declares these functions:

void msleep(unsigned int 毫秒);
无符号长 msleep_interruptible(无符号 int 毫秒);
void ssleep(无符号整数秒)
void msleep(unsigned int millisecs);
unsigned long msleep_interruptible(unsigned int millisecs);
void ssleep(unsigned int seconds)

前两个函数使调用进程休眠指定数量的 millisecs对msleep 的调用是不间断的;您可以确保该进程至少休眠给定的毫秒数。如果您的司机正在等待队列中,并且您希望唤醒来打破 睡眠,使用 msleep_interruptiblemsleep_interruptible的返回值 通常为0;然而,如果进程被提前唤醒,则返回值是最初请求的睡眠周期中剩余的毫秒数。调用 ssleep会使进程进入不间断睡眠状态并持续给定的秒数。

The first two functions puts the calling process to sleep for the given number of millisecs. A call to msleep is uninterruptible; you can be sure that the process sleeps for at least the given number of milliseconds. If your driver is sitting on a wait queue and you want a wakeup to break the sleep, use msleep_interruptible. The return value from msleep_interruptible is normally 0; if, however, the process is awakened early, the return value is the number of milliseconds remaining in the originally requested sleep period. A call to ssleep puts the process into an uninterruptible sleep for the given number of seconds.

一般来说,如果您 可以容忍比请求更长的延迟,您应该使用Schedule_timeoutmsleepssleep

In general, if you can tolerate longer delays than requested, you should use schedule_timeout, msleep, or ssleep.

内核定时器

Kernel Timers

每当您需要安排 稍后发生的操作,而不会阻塞当前进程直到该时间到达,内核计时器是您的工具。这些定时器用于根据时钟滴答时间安排函数在未来的特定时间执行,并且可用于各种任务;例如,当硬件无法触发中断时,通过定期检查设备的状态来轮询设备。内核定时器的其他典型用途是关闭软盘电机或完成另一个长时间的关闭操作。在这种情况下,延迟关闭返回 会给应用程序带来不必要的(并且令人惊讶的)成本。最后,内核本身在几种情况下使用定时器,包括实现时间表超时

Whenever you need to schedule an action to happen later, without blocking the current process until that time arrives, kernel timers are the tool for you. These timers are used to schedule execution of a function at a particular time in the future, based on the clock tick, and can be used for a variety of tasks; for example, polling a device by checking its state at regular intervals when the hardware can't fire interrupts. Other typical uses of kernel timers are turning off the floppy motor or finishing another lengthy shut down operation. In such cases, delaying the return from close would impose an unnecessary (and surprising) cost on the application program. Finally, the kernel itself uses the timers in several situations, including the implementation of schedule_timeout.

内核定时器是一种数据结构,它指示内核在用户定义的时间使用用户定义的参数执行用户定义的函数。该实现位于 <linux/timer.h>kernel/timer.c中,并在第 7.4.2 节中详细描述

A kernel timer is a data structure that instructs the kernel to execute a user-defined function with a user-defined argument at a user-defined time. The implementation resides in <linux/timer.h> and kernel/timer.c and is described in detail in the Section 7.4.2

当注册它们的进程正在执行时,计划运行的函数几乎肯定不会运行。相反,它们是异步运行的。到目前为止,我们在示例驱动程序中所做的一切都在执行系统调用的进程上下文中运行。然而,当计时器运行时,调度它的进程可能处于睡眠状态,在不同的处理器上执行,或者很可能已经完全退出。

The functions scheduled to run almost certainly do not run while the process that registered them is executing. They are, instead, run asynchronously. Until now, everything we have done in our sample drivers has run in the context of a process executing system calls. When a timer runs, however, the process that scheduled it could be asleep, executing on a different processor, or quite possibly has exited altogether.

这种异步执行类似于硬件中断发生时发生的情况(这将在第 10 章中详细讨论)。事实上,内核定时器是作为“软件中断”的结果运行的。当在这种原子上下文中运行时,您的代码会受到许多约束。计时器函数必须以我们在第 5 章中讨论的所有方式都是原子的,但是由于缺乏进程上下文而带来了一些其他问题。我们现在就介绍一下这些约束条件;在后面的章节中,它们将在多个地方再次出现。之所以需要重复,是因为必须严格遵守原子上下文的规则,否则系统会发现自己陷入困境。

This asynchronous execution resembles what happens when a hardware interrupt happens (which is discussed in detail in Chapter 10). In fact, kernel timers are run as the result of a "software interrupt." When running in this sort of atomic context, your code is subject to a number of constraints. Timer functions must be atomic in all the ways we discussed in Chapter 5, but there are some additional issues brought about by the lack of a process context. We will introduce these constraints now; they will be seen again in several places in later chapters. Repetition is called for because the rules for atomic contexts must be followed assiduously, or the system will find itself in deep trouble.

许多操作需要流程的上下文才能执行。当您处于进程上下文之外(即在中断上下文中)时,必须遵守以下规则:

A number of actions require the context of a process in order to be executed. When you are outside of process context (i.e., in interrupt context), you must observe the following rules:

  • 不允许访问用户空间。由于没有进程上下文,因此没有与任何特定进程关联的用户空间的路径。

  • No access to user space is allowed. Because there is no process context, there is no path to the user space associated with any particular process.

  • current指针在原子模式下没有意义,无法使用,因为相关代码与已中断的进程没有联系。

  • The current pointer is not meaningful in atomic mode and cannot be used since the relevant code has no connection with the process that has been interrupted.

  • 不能执行休眠或调度。原子代码不能调用 Schedule或wait_event的形式,也不能调用任何其他可能休眠的函数。例如,调用kmalloc(..., GFP_KERNEL)是违反规则的。信号量也不能使用,因为它们可以休眠。

  • No sleeping or scheduling may be performed. Atomic code may not call schedule or a form of wait_event, nor may it call any other function that could sleep. For example, calling kmalloc(..., GFP_KERNEL) is against the rules. Semaphores also must not be used since they can sleep.

内核代码可以通过调用函数 in_interrupt ()来判断它是否在中断上下文中运行,该函数不带任何参数,并且如果处理器当前在中断上下文(硬件中断或软件中断)中运行,则返回非零值。

Kernel code can tell if it is running in interrupt context by calling the function in_interrupt( ), which takes no parameters and returns nonzero if the processor is currently running in interrupt context, either hardware interrupt or software interrupt.

与in_interrupt()相关的函数是in_atomic()。当不允许调度时,它的返回值不为零;这包括硬件和软件中断上下文以及持有自旋锁的任何时间。在后一种情况下, current可能是有效的,但禁止访问用户空间,因为它可能导致调度发生。每当您使用 in_interrupt()时,您都应该真正考虑 in_atomic()是否是您真正的意思。这两个函数都在<asm/hardirq.h>中声明

A function related to in_interrupt( ) is in_atomic( ). Its return value is nonzero whenever scheduling is not allowed; this includes hardware and software interrupt contexts as well as any time when a spinlock is held. In the latter case, current may be valid, but access to user space is forbidden, since it can cause scheduling to happen. Whenever you are using in_interrupt( ), you should really consider whether in_atomic( ) is what you actually mean. Both functions are declared in <asm/hardirq.h>

内核定时器的另一个重要特性是任务可以重新注册自己以便稍后再次运行。这是可能的,因为每个timer_list结构在运行之前都从活动计时器列表中取消链接,因此可以立即在其他地方重新链接。尽管一遍又一遍地重新安排同一任务似乎是毫无意义的操作,但有时它是有用的。例如,它可以用来实现设备的轮询。

One other important feature of kernel timers is that a task can reregister itself to run again at a later time. This is possible because each timer_list structure is unlinked from the list of active timers before being run and can, therefore, be immediately re-linked elsewhere. Although rescheduling the same task over and over might appear to be a pointless operation, it is sometimes useful. For example, it can be used to implement the polling of devices.

还值得了解的是,在 SMP 系统中,计时器函数由注册它的同一 CPU 执行,以尽可能实现更好的缓存局部性。因此,重新注册自身的定时器总是在同一个CPU上运行。

It's also worth knowing that in an SMP system, the timer function is executed by the same CPU that registered it, to achieve better cache locality whenever possible. Therefore, a timer that reregisters itself always runs on the same CPU.

不过,不应忘记计时器的一个重要特征,即它们是竞争条件的潜在来源,即使在单处理器系统上也是如此。这是它们与其他代码异步的直接结果。因此,计时器函数访问的任何数据结构都应该受到保护,免受并发访问,无论是通过原子类型(在原子变量部分讨论)还是通过使用自旋锁(在 第 5 章中讨论)。

An important feature of timers that should not be forgotten, though, is that they are a potential source of race conditions, even on uniprocessor systems. This is a direct result of their being asynchronous with other code. Therefore, any data structures accessed by the timer function should be protected from concurrent access, either by being atomic types (discussed in the section Atomic Variables) or by using spinlocks (discussed in Chapter 5).

定时器 API

The Timer API

内核提供驱动程序 具有许多声明、注册和删除内核定时器的函数。以下摘录显示了基本构建块:

The kernel provides drivers with a number of functions to declare, register, and remove kernel timers. The following excerpt shows the basic building blocks:

#include <linux/timer.h>
结构定时器列表{
        /* ... */
        无符号长过期;
        void (*函数)(无符号长整型);
        无符号长数据;
};

无效 init_timer(struct timer_list *timer);
struct timer_list TIMER_INITIALIZER(_function, _expires, _data);

无效add_timer(结构timer_list *计时器);
int del_timer(结构timer_list *计时器);
#include <linux/timer.h>
struct timer_list {
        /* ... */
        unsigned long expires;
        void (*function)(unsigned long);
        unsigned long data;
};

void init_timer(struct timer_list *timer);
struct timer_list TIMER_INITIALIZER(_function, _expires, _data);

void add_timer(struct timer_list * timer);
int del_timer(struct timer_list * timer);

数据结构包含的字段比所示的字段多,但这三个字段是要从定时器代码 iteslf 外部访问的。该expires字段表示jiffies定时器预计运行时的值;此时,函数 functiondata将作为参数被调用。如果您需要在参数中传递多个项目,您可以将它们捆绑为单个数据结构,并传递一个指向 的指针,这是unsigned long所有支持的体系结构上的安全做法,并且在内存管理中很常见(如第 15 章所述)。该expires值不是一个 jiffies_64项目是因为预计计时器在未来不会过期很长时间,并且 64 位操作在 32 位平台上速度很慢。

The data structure includes more fields than the ones shown, but those three are the ones that are meant to be accessed from outside the timer code iteslf. The expires field represents the jiffies value when the timer is expected to run; at that time, the function function is called with data as an argument. If you need to pass multiple items in the argument, you can bundle them as a single data structure and pass a pointer cast to unsigned long, a safe practice on all supported architectures and pretty common in memory management (as discussed in Chapter 15). The expires value is not a jiffies_64 item because timers are not expected to expire very far in the future, and 64-bit operations are slow on 32-bit platforms.

该结构体在使用前必须进行初始化。此步骤确保正确设置所有字段,包括对调用者不透明的字段。根据您的需要,可以通过调用init_timer或分配TIMER_INITIALIZER给静态结构来执行初始化。初始化后,您可以在调用add_timer之前更改这三个公共字段 。要在注册的计时器到期之前将其禁用,请调用 del_timer

The structure must be initialized before use. This step ensures that all the fields are properly set up, including the ones that are opaque to the caller. Initialization can be performed by calling init_timer or assigning TIMER_INITIALIZER to a static structure, according to your needs. After initialization, you can change the three public fields before calling add_timer. To disable a registered timer before it expires, call del_timer.

jit模块包含一个示例文件/proc/jitimer(“Just in Timer”),它返回一个标题行和六个数据行。数据线代表当前代码运行的环境;第一个由文件操作生成,其他由定时器生成。编译内核时记录了以下输出:

The jit module includes a sample file, /proc/jitimer (for "just in timer"), that returns one header line and six data lines. The data lines represent the current environment where the code is running; the first one is generated by the read file operation and the others by a timer. The following output was recorded while compiling a kernel:

电话%cat /proc/jitimer
   时间增量 inirq pid cpu 命令
 33565837 0 0 1269 0 猫
 33565847 10 1 1271 0 号
 33565857 10 1 1273 0 cpp0
 33565867 10 1 1273 0 cpp0
 33565877 10 1 1274 0 cc1
 33565887 10 1 1274 0 cc1
phon% cat /proc/jitimer
   time   delta  inirq    pid   cpu command
 33565837    0     0      1269   0   cat
 33565847   10     1      1271   0   sh
 33565857   10     1      1273   0   cpp0
 33565867   10     1      1273   0   cpp0
 33565877   10     1      1274   0   cc1
 33565887   10     1      1274   0   cc1

在此输出中,该字段是代码运行时time的值 ,是自上一行以来的变化,是in_interrupt返回的布尔值 ,引用当前进程,并且是正在使用的 CPU 数量(始终在单处理器系统上)。jiffiesdeltajiffiesinirqpidcommandcpu0

In this output, the time field is the value of jiffies when the code runs, delta is the change in jiffies since the previous line, inirq is the Boolean value returned by in_interrupt, pid and command refer to the current process, and cpu is the number of the CPU being used (always 0 on uniprocessor systems).

如果你在系统卸载时读取/proc/jitimer,你会发现计时器的上下文是 process 0,即空闲任务,它被称为“swapper”主要是由于历史原因。

If you read /proc/jitimer while the system is unloaded, you'll find that the context of the timer is process 0, the idle task, which is called "swapper" mainly for historical reasons.

用于生成/proc/jitimer数据的计时器默认每 10 jiffies 运行一次,但您可以tdelay在加载模块时通过设置(计时器延迟)参数来更改该值。

The timer used to generate /proc/jitimer data is run every 10 jiffies by default, but you can change the value by setting the tdelay (timer delay) parameter when loading the module.

以下代码摘录显示了jit 中与jitimer定时器相关的 部分。当进程尝试读取我们的文件时,我们按如下方式设置计时器:

The following code excerpt shows the part of jit related to the jitimer timer. When a process attempts to read our file, we set up the timer as follows:

无符号长 j = jiffies;

/* 为我们的定时器函数填充数据 */
数据->prevjiffies = j;
数据->buf = buf2;
数据->循环= JIT_ASYNC_LOOPS;
    
/* 注册定时器 */
数据->timer.data = (unsigned long)data;
数据->timer.function = jit_timer_fn;
数据->timer.expires = j + tdelay; /* 范围 */
add_timer(&数据->计时器);

/* 等待缓冲区填满 */
wait_event_interruptible(数据->等待,!数据->循环);
unsigned long j = jiffies;

/* fill the data for our timer function */
data->prevjiffies = j;
data->buf = buf2;
data->loops = JIT_ASYNC_LOOPS;
    
/* register the timer */
data->timer.data = (unsigned long)data;
data->timer.function = jit_timer_fn;
data->timer.expires = j + tdelay; /* parameter */
add_timer(&data->timer);

/* wait for the buffer to fill */
wait_event_interruptible(data->wait, !data->loops);

实际的定时器函数如下所示:

The actual timer function looks like this:

void jit_timer_fn(无符号长参数)
{
    struct jit_data *data = (struct jit_data *)arg;
    无符号长 j = jiffies;
    数据->buf += sprintf(数据->buf, "%9li %3li %i %6i %i %s\n",
                 j, j - 数据->prevjiffies, in_interrupt( ) ?1:0,
                 当前->pid, smp_processor_id( ), 当前->comm);

    if (--data->循环) {
        数据->timer.expires += tdelay;
        数据->prevjiffies = j;
        add_timer(&数据->计时器);
    } 别的 {
        wake_up_interruptible(&数据->等待);
    }
}
void jit_timer_fn(unsigned long arg)
{
    struct jit_data *data = (struct jit_data *)arg;
    unsigned long j = jiffies;
    data->buf += sprintf(data->buf, "%9li  %3li     %i    %6i   %i   %s\n",
                 j, j - data->prevjiffies, in_interrupt(  ) ? 1 : 0,
                 current->pid, smp_processor_id(  ), current->comm);

    if (--data->loops) {
        data->timer.expires += tdelay;
        data->prevjiffies = j;
        add_timer(&data->timer);
    } else {
        wake_up_interruptible(&data->wait);
    }
}

定时器 API 包括比上面介绍的更多的函数。以下集合完成了内核产品列表:

The timer API includes a few more functions than the ones introduced above. The following set completes the list of kernel offerings:

int mod_timer(struct timer_list *timer, unsigned long expires);
int mod_timer(struct timer_list *timer, unsigned long expires);

更新定时器的到期时间,这是使用超时定时器的常见任务(同样,电机关闭软盘定时器是一个典型的例子)。 mod_timer也可以在不活动的计时器上调用,通常使用add_timer

Updates the expiration time of a timer, a common task for which a timeout timer is used (again, the motor-off floppy timer is a typical example). mod_timer can be called on inactive timers as well, where you normally use add_timer.

int del_timer_sync(struct timer_list *timer);
int del_timer_sync(struct timer_list *timer);

与del_timer类似,但也保证当它返回时,计时器函数不会在任何 CPU 上运行。del_timer_sync用于避免 SMP 系统上的竞争条件,与 UP 内核中的del_timer相同 。在大多数情况下,该函数应该优于del_timer。如果从非原子上下文调用该函数,则该函数可以休眠,但在其他情况下会忙等待。调用del_timer_sync时要非常小心持有锁时;如果计时器函数尝试获取相同的锁,系统可能会死锁。如果定时器函数重新注册自身,调用者必须首先确保这种重新注册不会发生;这通常是通过设置“关闭”标志来完成的,该标志由计时器功能检查。

Works like del_timer, but also guarantees that when it returns, the timer function is not running on any CPU. del_timer_sync is used to avoid race conditions on SMP systems and is the same as del_timer in UP kernels. This function should be preferred over del_timer in most situations. This function can sleep if it is called from a nonatomic context but busy waits in other situations. Be very careful about calling del_timer_sync while holding locks; if the timer function attempts to obtain the same lock, the system can deadlock. If the timer function reregisters itself, the caller must first ensure that this reregistration will not happen; this is usually accomplished by setting a "shutting down" flag, which is checked by the timer function.

int timer_pending(const struct timer_list * timer);
int timer_pending(const struct timer_list * timer);

返回 true 或 false 以指示计时器当前是否计划通过读取结构的不透明字段之一来运行。

Returns true or false to indicate whether the timer is currently scheduled to run by reading one of the opaque fields of the structure.

内核定时器的实现

The Implementation of Kernel Timers

虽然你不需要知道内核定时器是如何实现的就可以使用它们,实现很有趣,并且值得一看它的内部结构。

Although you won't need to know how kernel timers are implemented in order to use them, the implementation is interesting, and a look at its internals is worthwhile.

定时器的实现旨在满足以下要求和假设:

The implementation of the timers has been designed to meet the following requirements and assumptions:

  • 计时器管理必须尽可能轻量。

  • Timer management must be as lightweight as possible.

  • 随着活动定时器数量的增加,设计应该能够很好地扩展。

  • The design should scale well as the number of active timers increases.

  • 大多数计时器最多会在几秒或几分钟内到期,而长时间延迟的计时器则非常罕见。

  • Most timers expire within a few seconds or minutes at most, while timers with long delays are pretty rare.

  • 定时器应该在注册它的同一个 CPU 上运行。

  • A timer should run on the same CPU that registered it.

内核开发人员设计的解决方案基于每个 CPU 的数据结构。timer_list 结构在其字段中包含指向该数据结构指针base。如果baseNULL,则定时器不被安排运行;否则,指针会告诉哪个数据结构(以及哪个 CPU)运行它。每个 CPU 的数据项在7.1.1节的8.5 节中描述。

The solution devised by kernel developers is based on a per-CPU data structure. The timer_list structure includes a pointer to that data structure in its base field. If base is NULL, the timer is not scheduled to run; otherwise, the pointer tells which data structure (and, therefore, which CPU) runs it. Per-CPU data items are described in Section 8.5 in Section 7.1.1.

每当内核代码注册一个计时器(通过add_timermod_timer)时,该操作最终由 internal_add_timer(in kernel/timer.c)执行,该操作又将新计时器添加到与当前计时器关联的“级联表”内的计时器双链表中。中央处理器。

Whenever kernel code registers a timer (via add_timer or mod_timer), the operation is eventually performed by internal_add_timer (in kernel/timer.c) which, in turn, adds the new timer to a double-linked list of timers within a "cascading table" associated to the current CPU.

级联表的工作方式如下:如果计时器在接下来的 255 个 jiffies 内到期,则使用该字段的最低有效位将其添加到专用于短程计时器的 256 个列表之一expires。如果它在未来更远的时间到期(但在 16,384 jiffies 之前),则会根据该expires字段的位 9-14 将其添加到 64 个列表之一。对于过期更远的定时器,相同的技巧用于位 15-20、21-26 和 27-31。具有指向更远未来的过期字段的计时器(这种情况只能在 64 位平台上发生)使用延迟值 进行哈希处理0xffffffff,并且计时器具有expires过去被安排在下一个计时器滴答处运行。(已经过期的计时器有时可能会在高负载情况下注册,特别是在运行可抢占内核的情况下。)

The cascading table works like this: if the timer expires in the next to 255 jiffies, it is added to one of the 256 lists devoted to short-range timers using the least significant bits of the expires field. If it expires farther in the future (but before 16,384 jiffies), it is added to one of 64 lists based on bits 9-14 of the expires field. For timers expiring even farther, the same trick is used for bits 15-20, 21-26, and 27-31. Timers with an expire field pointing still farther in the future (something that can happen only on 64-bit platforms) are hashed with a delay value of 0xffffffff, and timers with expires in the past are scheduled to run at the next timer tick. (A timer that is already expired may sometimes be registered in high-load situations, especially if you run a preemptible kernel.)

__run_timers被触发时,它会执行当前计时器滴答的所有待处理计时器。如果jiffies当前是 256 的倍数,则该函数还将下一级计时器列表之一重新散列到 256 个短期列表中,根据 的位表示,还可能级联一个或多个其他级别jiffies

When _ _run_timers is fired, it executes all pending timers for the current timer tick. If jiffies is currently a multiple of 256, the function also rehashes one of the next-level lists of timers into the 256 short-term lists, possibly cascading one or more of the other levels as well, according to the bit representation of jiffies.

这种方法虽然乍一看非常复杂,但无论是在计时器数量较少的情况下还是在计时器数量较多的情况下都表现得非常好。管理每个活动计时器所需的时间与已注册的计时器数量无关,并且仅限于对其字段的二进制表示形式进行的一些逻辑操作expires 。与此实现相关的唯一成本是 512 个列表头的内存(256 个短期列表和 4 组,每组 64 个列表),即 4 KB 存储空间。

This approach, while exceedingly complex at first sight, performs very well both with few timers and with a large number of them. The time required to manage each active timer is independent of the number of timers already registered and is limited to a few logic operations on the binary representation of its expires field. The only cost associated with this implementation is the memory for the 512 list heads (256 short-term lists and 4 groups of 64 more lists)—i.e., 4 KB of storage.

函数__run_timers ,如/proc/jitimer所示,在原子上下文中运行。除了我们已经描述的限制之外,这还带来了一个有趣的功能:即使您没有运行可抢占的内核,并且 CPU 在内核空间中忙碌,计时器也会在正确的时间到期。您可以看到当您在后台读取/proc/jitbusy和在前台读取/proc/jitimer时会发生什么。尽管系统似乎被忙等待系统调用牢牢锁定,但内核计时器仍然工作正常。

The function _ _run_timers, as shown by /proc/jitimer, is run in atomic context. In addition to the limitations we already described, this brings in an interesting feature: the timer expires at just the right time, even if you are not running a preemptible kernel, and the CPU is busy in kernel space. You can see what happens when you read /proc/jitbusy in the background and /proc/jitimer in the foreground. Although the system appears to be locked solid by the busy-waiting system call, the kernel timers still work fine.

但请记住,内核计时器远非完美,因为它会受到硬件中断以及其他计时器和其他异步任务引起的抖动和其他伪影的影响。虽然与简单数字 I/O 相关的定时器足以完成简单任务,例如运行步进电机或其他业余电子产品,但它通常不适合工业环境中的生产系统。对于此类任务,您很可能需要 求助于实时内核扩展。

Keep in mind, however, that a kernel timer is far from perfect, as it suffers from jitter and other artifacts induced by hardware interrupts, as well as other timers and other asynchronous tasks. While a timer associated with simple digital I/O can be enough for simple tasks like running a stepper motor or other amateur electronics, it is usually not suitable for production systems in industrial environments. For such tasks, you'll most likely need to resort to a real-time kernel extension.

小任务

Tasklets

另一个与相关的内核设施 时序问题就是tasklet机制。它主要用于中断管理(我们将在第 10 章中再次看到它。)

Another kernel facility related to timing issues is the tasklet mechanism. It is mostly used in interrupt management (we'll see it again in Chapter 10.)

Tasklet 在某些方面类似于内核定时器。它们总是在中断时运行,它们总是在调度它们的同一个 CPU 上运行,并且它们接收一个unsigned long参数。然而,与内核计时器不同的是,您不能要求在特定时间执行该函数。通过调度一个tasklet,您只需要求它在内核选择的稍后时间执行即可。此行为对于中断处理程序特别有用,其中必须尽快管理硬件中断,但大多数数据管理可以安全地延迟到稍后的时间。实际上,tasklet 就像内核计时器一样,在“软中断”上下文中执行(以原子模式),“软中断”是一种在启用硬件中断的情况下执行异步任务的内核机制。

Tasklets resemble kernel timers in some ways. They are always run at interrupt time, they always run on the same CPU that schedules them, and they receive an unsigned long argument. Unlike kernel timers, however, you can't ask to execute the function at a specific time. By scheduling a tasklet, you simply ask for it to be executed at a later time chosen by the kernel. This behavior is especially useful with interrupt handlers, where the hardware interrupt must be managed as quickly as possible, but most of the data management can be safely delayed to a later time. Actually, a tasklet, just like a kernel timer, is executed (in atomic mode) in the context of a "soft interrupt," a kernel mechanism that executes asynchronous tasks with hardware interrupts enabled.

微线程作为一种数据结构存在,在使用之前必须对其进行初始化。可以通过调用特定函数或使用某些宏声明结构来执行初始化:

A tasklet exists as a data structure that must be initialized before use. Initialization can be performed by calling a specific function or by declaring the structure using certain macros:

#include <linux/中断.h>

结构体tasklet_struct {
      /* ... */
      void (*func)(无符号长整型);
      无符号长数据;
};

无效tasklet_init(结构tasklet_struct * t,
      void (*func)(unsigned long), unsigned long data);
DECLARE_TASKLET(名称,功能,数据);
DECLARE_TASKLET_DISABLED(名称,功能,数据);
#include <linux/interrupt.h>

struct tasklet_struct {
      /* ... */
      void (*func)(unsigned long);
      unsigned long data;
};

void tasklet_init(struct tasklet_struct *t,
      void (*func)(unsigned long), unsigned long data);
DECLARE_TASKLET(name, func, data);
DECLARE_TASKLET_DISABLED(name, func, data);

Tasklet 提供了许多有趣的功能:

Tasklets offer a number of interesting features:

  • 可以禁用并稍后重新启用微线程;直到它被启用和被禁用的次数一样多时,它才会被执行。

  • A tasklet can be disabled and re-enabled later; it won't be executed until it is enabled as many times as it has been disabled.

  • 就像定时器一样,tasklet 可以重新注册自己。

  • Just like timers, a tasklet can reregister itself.

  • 可以将微线程安排为以正常优先级或高优先级执行。后一组始终首先执行。

  • A tasklet can be scheduled to execute at normal priority or high priority. The latter group is always executed first.

  • 如果系统负载不重,Tasklet 可能会立即运行,但绝不会晚于下一个计时器滴答。

  • Tasklets may be run immediately if the system is not under heavy load but never later than the next timer tick.

  • 一个tasklet 可以与其他tasklet 并发,但相对于其自身是严格序列化的——同一个tasklet 绝不会同时在多个处理器上运行。此外,如前所述,tasklet 始终在调度它的同一个 CPU 上运行。

  • A tasklets can be concurrent with other tasklets but is strictly serialized with respect to itself—the same tasklet never runs simultaneously on more than one processor. Also, as already noted, a tasklet always runs on the same CPU that schedules it.

jit模块包含两个文件/proc/jitasklet/proc/jitasklethi ,它们返回与第 7.4 节中介绍的/proc/jitimer相同的数据。当您读取其中一个文件时,您会返回一个标头和六个数据行。第一个数据行描述调用进程的上下文,其他行描述微线程过程的连续运行的上下文。这是编译内核时运行的示例:

The jit module includes two files, /proc/jitasklet and /proc/jitasklethi, that return the same data as /proc/jitimer, introduced in Section 7.4 When you read one of the files, you get back a header and six data lines. The first data line describes the context of the calling process, and the other lines describe the context of successive runs of a tasklet procedure. This is a sample run while compiling a kernel:

电话%cat /proc/jitasklet
   时间增量 inirq pid cpu 命令
  6076139 0 0 4370 0 猫
  6076140 1 1 4368 0 cc1
  6076141 1 1 4368 0 cc1
  6076141 0 1 2 0 ksoftirqd/0
  6076141 0 1 2 0 ksoftirqd/0
  6076141 0 1 2 0 ksoftirqd/0
phon% cat /proc/jitasklet
   time   delta  inirq    pid   cpu command
  6076139    0     0      4370   0   cat
  6076140    1     1      4368   0   cc1
  6076141    1     1      4368   0   cc1
  6076141    0     1         2   0   ksoftirqd/0
  6076141    0     1         2   0   ksoftirqd/0
  6076141    0     1         2   0   ksoftirqd/0

从上面的数据可以看出,只要 CPU 正忙于运行进程,tasklet 就会在下一个计时器滴答处运行,但当 CPU 空闲时,它会立即运行。内核提供了一组ksoftirqd内核线程,每个 CPU 一个,只是为了运行“软中断”处理程序,例如tasklet_action函数。因此,tasklet 的最后三次运行发生在 与 CPU 关联的ksoftirqd0内核线程的上下文中。jitasklethi实现使用高优先级的 tasklet,这在即将发布的函数列表中进行解释。

As confirmed by the above data, the tasklet is run at the next timer tick as long as the CPU is busy running a process, but it is run immediately when the CPU is otherwise idle. The kernel provides a set of ksoftirqd kernel threads, one per CPU, just to run "soft interrupt" handlers, such as the tasklet_action function. Thus, the final three runs of the tasklet take place in the context of the ksoftirqd kernel thread associated to CPU 0. The jitasklethi implementation uses a high-priority tasklet, explained in an upcoming list of functions.

jit中实现/proc/jitasklet/proc/jitasklethi的实际代码与实现/proc/jitimer的代码几乎相同,但它使用 tasklet 调用而不是计时器调用。下面的列表详细列出了微线程结构初始化后的内核接口:

The actual code in jit that implements /proc/jitasklet and /proc/jitasklethi is almost identical to the code that implements /proc/jitimer, but it uses the tasklet calls instead of the timer ones. The following list lays out in detail the kernel interface to tasklets after the tasklet structure has been initialized:

void tasklet_disable(struct tasklet_struct *t);
void tasklet_disable(struct tasklet_struct *t);

该函数禁用给定的tasklet。该tasklet 仍可以使用tasklet_schedule进行调度 ,但其执行会被推迟,直到再次启用该tasklet。如果tasklet当前正在运行,则该函数忙等待直到tasklet退出;因此,在调用 tasklet_disable之后,您可以确定tasklet没有在系统中的任何地方运行。

This function disables the given tasklet. The tasklet may still be scheduled with tasklet_schedule, but its execution is deferred until the tasklet has been enabled again. If the tasklet is currently running, this function busy-waits until the tasklet exits; thus, after calling tasklet_disable, you can be sure that the tasklet is not running anywhere in the system.

void tasklet_disable_nosync(struct tasklet_struct *t);
void tasklet_disable_nosync(struct tasklet_struct *t);

禁用该tasklet,但不等待任何当前正在运行的函数退出。当它返回时,tasklet 被禁用,并且在重新启用之前不会被调度,但当函数返回时它可能仍在另一个 CPU 上运行。

Disable the tasklet, but without waiting for any currently-running function to exit. When it returns, the tasklet is disabled and won't be scheduled in the future until re-enabled, but it may be still running on another CPU when the function returns.

void tasklet_enable(struct tasklet_struct *t);
void tasklet_enable(struct tasklet_struct *t);

启用先前禁用的微线程。如果该tasklet已经被调度,它将很快运行。对tasklet_enable的调用必须与对tasklet_disable 的每次调用相匹配,因为内核会跟踪每个tasklet 的“禁用计数”。

Enables a tasklet that had been previously disabled. If the tasklet has already been scheduled, it will run soon. A call to tasklet_enable must match each call to tasklet_disable, as the kernel keeps track of the "disable count" for each tasklet.

void tasklet_schedule(struct tasklet_struct *t);
void tasklet_schedule(struct tasklet_struct *t);

安排tasklet 的执行。如果一个tasklet在有机会运行之前被再次调度,它只运行一次。但是,如果在运行时安排它 ,则它会在完成后再次运行;这确保了在处理其他事件时发生的事件得到应有的关注。此行为还允许微线程重新调度自身。

Schedule the tasklet for execution. If a tasklet is scheduled again before it has a chance to run, it runs only once. However, if it is scheduled while it runs, it runs again after it completes; this ensures that events occurring while other events are being processed receive due attention. This behavior also allows a tasklet to reschedule itself.

void tasklet_hi_schedule(struct tasklet_struct *t);
void tasklet_hi_schedule(struct tasklet_struct *t);

安排tasklet 以更高的优先级执行。当软中断处理程序运行时,它会先处理高优先级的微线程,然后再处理其他软中断任务,包括“普通”微线程。理想情况下,只有具有低延迟要求的任务(例如填充音频缓冲区)才应使用此函数,以避免其他软中断处理程序引入的额外延迟。实际上,/proc/jitasklethi与/proc/jitasklet没有人类可见的差异 。

Schedule the tasklet for execution with higher priority. When the soft interrupt handler runs, it deals with high-priority tasklets before other soft interrupt tasks, including "normal" tasklets. Ideally, only tasks with low-latency requirements (such as filling the audio buffer) should use this function, to avoid the additional latencies introduced by other soft interrupt handlers. Actually, /proc/jitasklethi shows no human-visible difference from /proc/jitasklet.

void tasklet_kill(struct tasklet_struct *t);
void tasklet_kill(struct tasklet_struct *t);

该函数确保tasklet不会被调度再次运行;通常在关闭设备或删除模块时调用它。如果计划运行该微线程,则该函数将等待直至其执行。如果tasklet自行重新调度,则必须在调用tasklet_kill之前阻止它自行重新调度 ,就像 del_timer_sync一样。

This function ensures that the tasklet is not scheduled to run again; it is usually called when a device is being closed or the module removed. If the tasklet is scheduled to run, the function waits until it has executed. If the tasklet reschedules itself, you must prevent it from rescheduling itself before calling tasklet_kill, as with del_timer_sync.

小任务 在kernel/softirq.c中实现。两个tasklet列表(普通和高优先级)被声明为每个CPU的数据结构,使用与内核定时器相同的CPU亲和性机制。小任务管理中使用的数据结构是一个简单的链表,因为小任务没有内核定时器的排序要求。

Tasklets are implemented in kernel/softirq.c. The two tasklet lists (normal and high-priority) are declared as per-CPU data structures, using the same CPU-affinity mechanism used by kernel timers. The data structure used in tasklet management is a simple linked list, because tasklets have none of the sorting requirements of kernel timers.

工作队列

Workqueues

从表面上看,工作队列类似于微线程;它们允许内核代码请求调用函数 在未来的某个时间。然而,两者之间存在一些显着差异,包括:

Workqueues are, superficially, similar to tasklets; they allow kernel code to request that a function be called at some future time. There are, however, some significant differences between the two, including:

  • Tasklet 在软件中断上下文中运行,因此所有 Tasklet 代码都必须是原子的。相反,工作队列函数在特殊的内核进程的上下文中运行;因此,他们拥有更大的灵活性。特别是,工作队列函数可以休眠。

  • Tasklets run in software interrupt context with the result that all tasklet code must be atomic. Instead, workqueue functions run in the context of a special kernel process; as a result, they have more flexibility. In particular, workqueue functions can sleep.

  • Tasklet 始终在最初提交它们的处理器上运行。默认情况下,工作队列以相同的方式工作。

  • Tasklets always run on the processor from which they were originally submitted. Workqueues work in the same way, by default.

  • 内核代码可以请求将工作队列函数的执行延迟明确的时间间隔。

  • Kernel code can request that the execution of workqueue functions be delayed for an explicit interval.

两者之间的主要区别在于,tasklet 在短时间内以原子模式快速执行,而工作队列函数可能具有较高的延迟,但不一定是原子的。每种机制都有适合的情况。

The key difference between the two is that tasklets execute quickly, for a short period of time, and in atomic mode, while workqueue functions may have higher latency but need not be atomic. Each mechanism has situations where it is appropriate.

工作队列有一个 类型,它在<linux/workqueue.h>struct workqueue_struct中定义。工作队列必须在使用前显式创建,使用以下两个函数之一:

Workqueues have a type of struct workqueue_struct, which is defined in <linux/workqueue.h>. A workqueue must be explicitly created before use, using one of the following two functions:

struct workqueue_struct *create_workqueue(const char *name);
struct workqueue_struct *create_singlethread_workqueue(const char *name);
struct workqueue_struct *create_workqueue(const char *name);
struct workqueue_struct *create_singlethread_workqueue(const char *name);

每个工作队列都有一个或多个专用进程(“内核线程”),它们运行提交到队列的函数。如果您使用create_workqueue,您将获得一个工作队列,该工作队列为系统上的每个处理器都有一个专用线程。在许多情况下,所有这些线程都显得有些过分了。如果单个工作线程就足够了,请使用create_singlethread_workqueue创建工作队列。

Each workqueue has one or more dedicated processes ("kernel threads"), which run functions submitted to the queue. If you use create_workqueue, you get a workqueue that has a dedicated thread for each processor on the system. In many cases, all those threads are simply overkill; if a single worker thread will suffice, create the workqueue with create_singlethread_workqueue instead.

要将任务提交到工作队列,您需要填写一个work_struct结构。这可以在编译时完成,如下所示:

To submit a task to a workqueue, you need to fill in a work_struct structure. This can be done at compile time as follows:

DECLARE_WORK(名称, void (*函数)(void *), void *数据);
DECLARE_WORK(name, void (*function)(void *), void *data);

其中name是要声明的结构的名称,function是要从工作队列调用的函数, 是data要传递给该函数的值。如果需要work_struct在运行时设置结构体,请使用以下两个宏:

Where name is the name of the structure to be declared, function is the function that is to be called from the workqueue, and data is a value to pass to that function. If you need to set up the work_struct structure at runtime, use the following two macros:

INIT_WORK(struct work_struct *work, void (*function)(void *), void *data);
PREPARE_WORK(struct work_struct *work, void (*function)(void *), void *data);
INIT_WORK(struct work_struct *work, void (*function)(void *), void *data);
PREPARE_WORK(struct work_struct *work, void (*function)(void *), void *data);

INIT_WORK在初始化结构方面做了更彻底的工作;您应该在第一次设置该结构时使用它。PREPARE_WORKwork_struct执行几乎相同的工作,但它不会初始化用于将结构链接到工作队列的 指针如果该结构当前可能已提交到工作队列,并且您需要更改该结构,请使用PREPARE_WORK而不是 INIT_WORK

INIT_WORK does a more thorough job of initializing the structure; you should use it the first time that structure is set up. PREPARE_WORK does almost the same job, but it does not initialize the pointers used to link the work_struct structure into the workqueue. If there is any possibility that the structure may currently be submitted to a workqueue, and you need to change that structure, use PREPARE_WORK rather than INIT_WORK.

有两个函数用于将工作提交到工作队列:

There are two functions for submitting work to a workqueue:

int queue_work(struct workqueue_struct *queue, struct work_struct *work);
int queue_delayed_work(struct workqueue_struct *queue,
                       struct work_struct *work,无符号长延迟);
int queue_work(struct workqueue_struct *queue, struct work_struct *work);
int queue_delayed_work(struct workqueue_struct *queue, 
                       struct work_struct *work, unsigned long delay);

任一添加work到给定的queue. 但是,如果使用queue_delayed_work,则至少要delay经过jiffies 后才会执行实际工作。这些函数的返回值是0工作是否成功添加到队列中;非零结果意味着该work_struct结构已经在队列中等待,并且没有第二次添加。

Either one adds work to the given queue. If queue_delayed_work is used, however, the actual work is not performed until at least delay jiffies have passed. The return value from these functions is 0 if the work was successfully added to the queue; a nonzero result means that this work_struct structure was already waiting in the queue, and was not added a second time.

在未来的某个时间,将使用给定data值调用工作函数。该函数将在工作线程的上下文中运行,因此它可以在需要时休眠 - 尽管您应该知道该休眠可能会如何影响提交到同一工作队列的任何其他任务。然而,该函数不能做的是访问用户空间。由于它在内核线程内运行,因此根本没有可访问的用户空间。

At some time in the future, the work function will be called with the given data value. The function will be running in the context of the worker thread, so it can sleep if need be—although you should be aware of how that sleep might affect any other tasks submitted to the same workqueue. What the function cannot do, however, is access user space. Since it is running inside a kernel thread, there simply is no user space to access.

如果您需要取消待处理的工作队列条目,您可以致电:

Should you need to cancel a pending workqueue entry, you may call:

int cancel_delayed_work(struct work_struct *work);
int cancel_delayed_work(struct work_struct *work);

如果条目在开始执行之前被取消,则返回值非零。内核保证在调用cancel_delayed_work后不会启动给定条目的执行 。但是,如果cancel_delayed_work 返回0,则该条目可能已经在不同的处理器上运行,并且在调用 cancel_delayed_work后可能仍在运行。为了绝对确保在cancel_delayed_work返回 后工作函数不会在系统中的任何地方运行0,您必须在该调用之后调用:

The return value is nonzero if the entry was canceled before it began execution. The kernel guarantees that execution of the given entry will not be initiated after a call to cancel_delayed_work. If cancel_delayed_work returns 0, however, the entry may have already been running on a different processor, and might still be running after a call to cancel_delayed_work. To be absolutely sure that the work function is not running anywhere in the system after cancel_delayed_work returns 0, you must follow that call with a call to:

无效flush_workqueue(结构workqueue_struct *队列);
void flush_workqueue(struct workqueue_struct *queue);

flush_workqueue返回后,调用之前提交的工作函数不会在系统中的任何地方运行。

After flush_workqueue returns, no work function submitted prior to the call is running anywhere in the system.

当你完成工作队列后,你可以通过以下方式删除它:

When you are done with a workqueue, you can get rid of it with:

void destroy_workqueue(struct workqueue_struct *queue);
void destroy_workqueue(struct workqueue_struct *queue);

共享队列

The Shared Queue

在许多情况下,设备驱动程序不需要自己的工作队列。如果您只是偶尔向队列提交任务,那么简单地使用内核提供的共享默认工作队列可能会更有效。但是,如果您使用此队列,您必须意识到您将与其他人共享它。除此之外,这意味着您不应该长时间独占队列(不长时间休眠),并且您的任务可能需要更长的时间才能轮到处理器。

A device driver, in many cases, does not need its own workqueue. If you only submit tasks to the queue occasionally, it may be more efficient to simply use the shared, default workqueue that is provided by the kernel. If you use this queue, however, you must be aware that you will be sharing it with others. Among other things, that means that you should not monopolize the queue for long periods of time (no long sleeps), and it may take longer for your tasks to get their turn in the processor.

jiq (“刚刚在队列中”)模块导出两个文件,演示共享工作队列的使用。它们使用单​​一work_struct结构,其设置方式如下:

The jiq ("just in queue") module exports two files that demonstrate the use of the shared workqueue. They use a single work_struct structure, which is set up this way:

静态结构work_struct jiq_work;

    /* 该行位于 jiq_init( ) 中 */
    INIT_WORK(&jiq_work, jiq_print_wq, &jiq_data);
static struct work_struct jiq_work;

    /* this line is in jiq_init(  ) */
    INIT_WORK(&jiq_work, jiq_print_wq, &jiq_data);

当一个进程读取/proc/jiqwq,模块立即通过共享工作队列启动一系列行程。它使用的函数是:

When a process reads /proc/jiqwq, the module initiates a series of trips through the shared workqueue with no delay. The function it uses is:

int Schedule_work(struct work_struct *work);
int schedule_work(struct work_struct *work);

请注意,在使用共享队列时使用不同的函数;它只需要work_struct论证的结构。jiq中的实际代码如下所示:

Note that a different function is used when working with the shared queue; it requires only the work_struct structure for an argument. The actual code in jiq looks like this:

准备等待(&jiq_wait,&等待,TASK_INTERRUPTIBLE);
Schedule_work(&jiq_work);
日程( );
finish_wait(&jiq_wait, &wait);
prepare_to_wait(&jiq_wait, &wait, TASK_INTERRUPTIBLE);
schedule_work(&jiq_work);
schedule(  );
finish_wait(&jiq_wait, &wait);

实际的工作函数像jit模块一样打印出一行 ,然后,如果需要,将work_struct结构重新提交到工作队列中。这是 jiq_print_wq的完整内容:

The actual work function prints out a line just like the jit module does, then, if need be, resubmits the work_struct structure into the workqueue. Here is jiq_print_wq in its entirety:

静态无效 jiq_print_wq(void *ptr)
{
    struct clientdata *data = (struct clientdata *) ptr;
    
    if (!jiq_print(ptr))
        返回;
    
    if(数据->延迟)
        Schedule_delayed_work(&jiq_work, 数据->延迟);
    别的
        Schedule_work(&jiq_work);
}
static void jiq_print_wq(void *ptr)
{
    struct clientdata *data = (struct clientdata *) ptr;
    
    if (! jiq_print (ptr))
        return;
    
    if (data->delay)
        schedule_delayed_work(&jiq_work, data->delay);
    else
        schedule_work(&jiq_work);
}

如果用户正在读取延迟设备(/proc/jiqwqdelay),工作函数会使用Schedule_delayed_work以延迟模式重新提交自身:

If the user is reading the delayed device (/proc/jiqwqdelay), the work function resubmits itself in the delayed mode with schedule_delayed_work:

int Schedule_delayed_work(struct work_struct *work, 无符号长延迟);
int schedule_delayed_work(struct work_struct *work, unsigned long delay);

如果您查看这两个设备的输出,它看起来像:

If you look at the output from these two devices, it looks something like:

%cat /proc/jiqwq
    时间增量抢占 pid cpu 命令
  1113043 0 0 7 1 事件/1
  1113043 0 0 7 1 事件/1
  1113043 0 0 7 1 事件/1
  1113043 0 0 7 1 事件/1
  1113043 0 0 7 1 事件/1
%cat /proc/jiqwqdelay
    时间增量抢占 pid cpu 命令
  1122066 1 0 6 0 事件/0
  1122067 1 0 6 0 事件/0
  1122068 1 0 6 0 事件/0
  1122069 1 0 6 0 事件/0
  1122070 1 0 6 0 事件/0
% cat /proc/jiqwq
    time  delta preempt   pid cpu command
  1113043     0       0     7   1 events/1
  1113043     0       0     7   1 events/1
  1113043     0       0     7   1 events/1
  1113043     0       0     7   1 events/1
  1113043     0       0     7   1 events/1
% cat /proc/jiqwqdelay
    time  delta preempt   pid cpu command
  1122066     1       0     6   0 events/0
  1122067     1       0     6   0 events/0
  1122068     1       0     6   0 events/0
  1122069     1       0     6   0 events/0
  1122070     1       0     6   0 events/0

读取/proc/jiqwq时,每行打印之间没有明显的延迟。相反,当读取/proc/jiqwqdelay时,每行之间存在正好一 jiffy 的延迟。无论哪种情况,我们都会看到打印相同的进程名称;它是实现共享工作队列的内核线程的名称。CPU编号打印在斜杠之后;我们永远不知道读取/proc文件时哪个 CPU 将运行 ,但此后工作函数将始终在同一处理器上运行。

When /proc/jiqwq is read, there is no obvious delay between the printing of each line. When, instead, /proc/jiqwqdelay is read, there is a delay of exactly one jiffy between each line. In either case, we see the same process name printed; it is the name of the kernel thread that implements the shared workqueue. The CPU number is printed after the slash; we never know which CPU will be running when the /proc file is read, but the work function will always run on the same processor thereafter.

如果您需要取消提交到共享队列的工作条目,您可以使用 cancel_delayed_work,如上所述。然而,刷新共享工作队列需要一个单独的函数:

If you need to cancel a work entry submitted to the shared queue, you may use cancel_delayed_work, as described above. Flushing the shared workqueue requires a separate function, however:

无效flush_scheduled_work(无效);
void flush_scheduled_work(void);

既然你不知道是谁 否则可能正在使用这个队列,你永远不知道flush_scheduled_work返回需要多长时间。

Since you do not know who else might be using this queue, you never really know how long it might take for flush_scheduled_work to return.

快速参考

Quick Reference

本章介绍了以下符号。

This chapter introduced the following symbols.

计时

Timekeeping

#include <linux/param.h>

HZ
#include <linux/param.h>

HZ

符号HZ指定数字 每秒生成的时钟滴答数。

The HZ symbol specifies the number of clock ticks generated per second.

#include <linux/jiffies.h>

volatile unsigned long jiffies

u64 jiffies_64
#include <linux/jiffies.h>

volatile unsigned long jiffies

u64 jiffies_64

变量jiffies_64增加的每个时钟周期一次;因此,它HZ每秒增加一次。内核代码最常引用,它与 64 位平台上的jiffies相同,并且在 32 位平台上是最低有效的一半。jiffies_64

The jiffies_64 variable is incremented once for each clock tick; thus, it's incremented HZ times per second. Kernel code most often refers to jiffies, which is the same as jiffies_64 on 64-bit platforms and the least significant half of it on 32-bit platforms.

int time_after(unsigned long a, unsigned long b);

int time_before(unsigned long a, unsigned long b);

int time_after_eq(unsigned long a, unsigned long b);

int time_before_eq(unsigned long a, unsigned long b);
int time_after(unsigned long a, unsigned long b);

int time_before(unsigned long a, unsigned long b);

int time_after_eq(unsigned long a, unsigned long b);

int time_before_eq(unsigned long a, unsigned long b);

这些布尔表达式以安全的方式比较 jiffies,在计数器溢出的情况下不会出现问题,并且不需要访问jiffies_64.

These Boolean expressions compare jiffies in a safe way, without problems in case of counter overflow and without the need to access jiffies_64.

u64 get_jiffies_64(void);
u64 get_jiffies_64(void);

没有竞争条件的检索jiffies_64

Retrieves jiffies_64 without race conditions.

#include <linux/time.h>

unsigned long timespec_to_jiffies(struct timespec *value);

void jiffies_to_timespec(unsigned long jiffies, struct timespec *value);

unsigned long timeval_to_jiffies(struct timeval *value);

void jiffies_to_timeval(unsigned long jiffies, struct timeval *value);
#include <linux/time.h>

unsigned long timespec_to_jiffies(struct timespec *value);

void jiffies_to_timespec(unsigned long jiffies, struct timespec *value);

unsigned long timeval_to_jiffies(struct timeval *value);

void jiffies_to_timeval(unsigned long jiffies, struct timeval *value);

在 jiffies 和其他表示形式之间转换时间表示形式。

Converts time representations between jiffies and other representations.

#include <asm/msr.h>

rdtsc(low32,high32);

rdtscl(low32);

rdtscll(var32);
#include <asm/msr.h>

rdtsc(low32,high32);

rdtscl(low32);

rdtscll(var32);

用于读取时间戳计数器的 x86 特定宏。他们将其读取为两个 32 位的一半,或者只读取下半部分,或者将其全部读取到一个long long变量中。

x86-specific macros to read the timestamp counter. They read it as two 32-bit halves, read only the lower half, or read all of it into a long long variable.

#include <linux/timex.h>

cycles_t get_cycles(void);
#include <linux/timex.h>

cycles_t get_cycles(void);

以与平台无关的方式返回时间戳计数器。如果 CPU 不提供时间戳功能,0则返回。

Returns the timestamp counter in a platform-independent way. If the CPU offers no timestamp feature, 0 is returned.

#include <linux/time.h>

unsigned long mktime(year, mon, day, h, m, s);
#include <linux/time.h>

unsigned long mktime(year, mon, day, h, m, s);

根据六个参数返回自纪元以来的秒数unsigned int

Returns the number of seconds since the Epoch, based on the six unsigned int arguments.

void do_gettimeofday(struct timeval *tv);
void do_gettimeofday(struct timeval *tv);

返回当前时间,以自纪元以​​来的秒和微秒表示,并具有硬件可以提供的最佳分辨率。在大多数平台上,分辨率为一微秒或更好,但某些平台仅提供 jiffies 分辨率。

Returns the current time, as seconds and microseconds since the Epoch, with the best resolution the hardware can offer. On most platforms the resolution is one microsecond or better, although some platforms offer only jiffies resolution.

struct timespec current_kernel_time(void);
struct timespec current_kernel_time(void);

返回当前时间,分辨率为 1 jiffy。

Returns the current time with the resolution of one jiffy.

延误

Delays

#include <linux/wait.h>

long wait_event_interruptible_timeout(wait_queue_head_t *q, condition, signed

long timeout);
#include <linux/wait.h>

long wait_event_interruptible_timeout(wait_queue_head_t *q, condition, signed

long timeout);

将当前进程置于等待队列中,设置一个以 jiffies 表示的超时值。使用schedule_timeout(如下)进行不间断睡眠。

Puts the current process to sleep on the wait queue, installing a timeout value expressed in jiffies. Use schedule_timeout (below) for noninterruptible sleeps.

#include <linux/sched.h>

signed long schedule_timeout(signed long timeout);
#include <linux/sched.h>

signed long schedule_timeout(signed long timeout);

调用调度程序 确保当前进程在超时到期后被唤醒。调用者必须首先调用 set_current_state将其自身置于可中断或不可中断睡眠状态。

Calls the scheduler after ensuring that the current process is awakened at timeout expiration. The caller must invoke set_current_state first to put itself in an interruptible or noninterruptible sleep state.

#include <linux/delay.h>

void ndelay(unsigned long nsecs);

void udelay(unsigned long usecs);

void mdelay(unsigned long msecs);
#include <linux/delay.h>

void ndelay(unsigned long nsecs);

void udelay(unsigned long usecs);

void mdelay(unsigned long msecs);

引入整数纳秒、微秒和毫秒的延迟。实现的延迟至少是所请求的值,但也可以更高。每个函数的参数不得超过特定于平台的限制(通常为几千)。

Introduces delays of an integer number of nanoseconds, microseconds, and milliseconds. The delay achieved is at least the requested value, but it can be more. The argument to each function must not exceed a platform-specific limit (usually a few thousands).

void msleep(unsigned int millisecs);

unsigned long msleep_interruptible(unsigned int millisecs);

void ssleep(unsigned int seconds);
void msleep(unsigned int millisecs);

unsigned long msleep_interruptible(unsigned int millisecs);

void ssleep(unsigned int seconds);

让进程休眠给定的毫秒数(或秒,在ssleep的情况下)。

Puts the process to sleep for the given number of milliseconds (or seconds, in the case of ssleep).

内核定时器

Kernel Timers

#include <asm/hardirq.h>

int in_interrupt(void);

int in_atomic(void);
#include <asm/hardirq.h>

int in_interrupt(void);

int in_atomic(void);

返回一个布尔值 指示调用代码是在中断上下文还是原子上下文中执行的值。在硬件或软件中断处理期间,中断上下文位于进程上下文之外。原子上下文是指您无法调度中断上下文或持有自旋锁的进程上下文。

Returns a Boolean value telling whether the calling code is executing in interrupt context or atomic context. Interrupt context is outside of a process context, either during hardware or software interrupt processing. Atomic context is when you can't schedule either an interrupt context or a process's context with a spinlock held.

#include <linux/timer.h>

void init_timer(struct timer_list * timer);

struct timer_list TIMER_INITIALIZER(_function, _expires, _data);
#include <linux/timer.h>

void init_timer(struct timer_list * timer);

struct timer_list TIMER_INITIALIZER(_function, _expires, _data);

该函数和定时器结构体的静态声明是初始化timer_list数据结构的两种方法。

This function and the static declaration of the timer structure are the two ways to initialize a timer_list data structure.

void add_timer(struct timer_list * timer);
void add_timer(struct timer_list * timer);

注册定时器结构体以在当前CPU上运行。

Registers the timer structure to run on the current CPU.

int mod_timer(struct timer_list *timer, unsigned long expires);
int mod_timer(struct timer_list *timer, unsigned long expires);

更改已安排的计时器结构的到期时间。它还可以充当add_timer的替代品。

Changes the expiration time of an already scheduled timer structure. It can also act as an alternative to add_timer.

int timer_pending(struct timer_list * timer);
int timer_pending(struct timer_list * timer);

返回布尔值的宏,指示计时器结构是否已注册运行。

Macro that returns a Boolean value stating whether the timer structure is already registered to run.

void del_timer(struct timer_list * timer);

void del_timer_sync(struct timer_list * timer);
void del_timer(struct timer_list * timer);

void del_timer_sync(struct timer_list * timer);

从活动计时器列表中删除计时器。后一个函数确保计时器当前没有在另一个 CPU 上运行。

Removes a timer from the list of active timers. The latter function ensures that the timer is not currently running on another CPU.

小任务

Tasklets

#include <linux/interrupt.h>

DECLARE_TASKLET(name, func, data);

DECLARE_TASKLET_DISABLED(name, func, data);

void tasklet_init(struct tasklet_struct *t, void (*func)(unsigned long),

unsigned long data);
#include <linux/interrupt.h>

DECLARE_TASKLET(name, func, data);

DECLARE_TASKLET_DISABLED(name, func, data);

void tasklet_init(struct tasklet_struct *t, void (*func)(unsigned long),

unsigned long data);

前两个宏声明一个tasklet结构,而 tasklet_init函数则初始化一个tasklet 通过分配或其他方式获得的结构。第二个DECLARE 宏将微线程标记为禁用。

The first two macros declare a tasklet structure, while the tasklet_init function initializes a tasklet structure that has been obtained by allocation or other means. The second DECLARE macro marks the tasklet as disabled.

void tasklet_disable(struct tasklet_struct *t);

void tasklet_disable_nosync(struct tasklet_struct *t);

void tasklet_enable(struct tasklet_struct *t);
void tasklet_disable(struct tasklet_struct *t);

void tasklet_disable_nosync(struct tasklet_struct *t);

void tasklet_enable(struct tasklet_struct *t);

禁用和重新启用小线程。每个禁用都必须与启用相匹配(即使该微线程已被禁用,您也可以禁用它)。如果tasklet 正在另一个CPU 上运行,则函数tasklet_disable将等待该tasklet 终止。nosync版本 不执行此额外步骤。

Disables and reenables a tasklet. Each disable must be matched with an enable (you can disable the tasklet even if it's already disabled). The function tasklet_disable waits for the tasklet to terminate if it is running on another CPU. The nosync version doesn't take this extra step.

void tasklet_schedule(struct tasklet_struct *t);

void tasklet_hi_schedule(struct tasklet_struct *t);
void tasklet_schedule(struct tasklet_struct *t);

void tasklet_hi_schedule(struct tasklet_struct *t);

安排小线程作为“普通”小线程或高优先级小线程运行。当执行软中断时,首先处理高优先级的tasklet,最后运行普通的tasklet。

Schedules a tasklet to run, either as a "normal" tasklet or a high-priority one. When soft interrupts are executed, high-priority tasklets are dealt with first, while normal tasklets run last.

void tasklet_kill(struct tasklet_struct *t);
void tasklet_kill(struct tasklet_struct *t);

如果计划运行该微线程,则将该微线程从活动列表中删除。与tasklet_disable类似 ,如果该函数当前正在另一个 CPU 上运行,则该函数可能会在 SMP 系统上阻塞,等待该 tasklet 终止。

Removes the tasklet from the list of active ones, if it's scheduled to run. Like tasklet_disable, the function may block on SMP systems waiting for the tasklet to terminate if it's currently running on another CPU.

工作队列

Workqueues

#include <linux/workqueue.h>

struct workqueue_struct;

struct work_struct;
#include <linux/workqueue.h>

struct workqueue_struct;

struct work_struct;

结构 分别代表一个工作队列和一个工作条目。

The structures representing a workqueue and a work entry, respectively.

struct workqueue_struct *create_workqueue(const char *name);

struct workqueue_struct *create_singlethread_workqueue(const char *name);

void destroy_workqueue(struct workqueue_struct *queue);
struct workqueue_struct *create_workqueue(const char *name);

struct workqueue_struct *create_singlethread_workqueue(const char *name);

void destroy_workqueue(struct workqueue_struct *queue);

用于创建和销毁工作队列的函数。对create_workqueue的调用 会在系统中的每个处理器上创建一个带有工作线程的队列;相反,create_singlethread_workqueue 创建一个具有单个工作进程的工作队列。

Functions for creating and destroying workqueues. A call to create_workqueue creates a queue with a worker thread on each processor in the system; instead, create_singlethread_workqueue creates a workqueue with a single worker process.

DECLARE_WORK(name, void (*function)(void *), void *data);

INIT_WORK(struct work_struct *work, void (*function)(void *), void *data);

PREPARE_WORK(struct work_struct *work, void (*function)(void *), void *data);
DECLARE_WORK(name, void (*function)(void *), void *data);

INIT_WORK(struct work_struct *work, void (*function)(void *), void *data);

PREPARE_WORK(struct work_struct *work, void (*function)(void *), void *data);

声明和初始化工作队列条目的宏。

Macros that declare and initialize workqueue entries.

int queue_work(struct workqueue_struct *queue, struct work_struct *work);

int queue_delayed_work(struct workqueue_struct *queue, struct work_struct

*work, unsigned long delay);
int queue_work(struct workqueue_struct *queue, struct work_struct *work);

int queue_delayed_work(struct workqueue_struct *queue, struct work_struct

*work, unsigned long delay);

将工作排队以从工作队列执行的函数。

Functions that queue work for execution from a workqueue.

int cancel_delayed_work(struct work_struct *work);

void flush_workqueue(struct workqueue_struct *queue);
int cancel_delayed_work(struct work_struct *work);

void flush_workqueue(struct workqueue_struct *queue);

使用cancel_delayed_work从工作队列中删除一个条目;lush_workqueue确保没有工作队列条目在系统中的任何位置运行。

Use cancel_delayed_work to remove an entry from a workqueue; flush_workqueue ensures that no workqueue entries are running anywhere in the system.

int schedule_work(struct work_struct *work);

int schedule_delayed_work(struct work_struct *work, unsigned long delay);

void flush_scheduled_work(void);
int schedule_work(struct work_struct *work);

int schedule_delayed_work(struct work_struct *work, unsigned long delay);

void flush_scheduled_work(void);

使用共享工作队列的函数。

Functions for working with the shared workqueue.




[ 1 ]需要 尾部nop指令来防止编译器访问紧随mfc0 的指令中的目标寄存器。这种互锁是 RISC 处理器的典型特征,编译器仍然可以在延迟槽中调度有用的指令。在这种情况下,我们使用nop,因为内联汇编对于编译器来说是一个黑匣子,无法进行优化。

[1] The trailing nop instruction is required to prevent the compiler from accessing the target register in the instruction immediately following mfc0. This kind of interlock is typical of RISC processors, and the compiler can still schedule useful instructions in the delay slots. In this case, we use nop because inline assembly is a black box for the compiler and no optimization can be performed.

[ 2 ] udelay代表希腊字母 mu 代表umicro

[2] The u in udelay represents the Greek letter mu and stands for micro.

第 8 章分配内存

Chapter 8. Allocating Memory

到目前为止,我们已经使用kmallockfree来分配和释放内存。然而,Linux 内核提供了一组更丰富的内存分配原语。在本章中,我们将了解在设备驱动程序中使用内存的其他方法以及如何优化系统的内存资源。我们不会深入了解不同架构如何实际管理内存。模块不涉及分段、分页等问题,因为内核为驱动程序提供了统一的内存管理接口。另外,本章不会描述内存管理的内部细节,而是推迟到第15章

Thus far, we have used kmalloc and kfree for the allocation and freeing of memory. The Linux kernel offers a richer set of memory allocation primitives, however. In this chapter, we look at other ways of using memory in device drivers and how to optimize your system's memory resources. We do not get into how the different architectures actually administer memory. Modules are not involved in issues of segmentation, paging, and so on, since the kernel offers a unified memory management interface to the drivers. In addition, we won't describe the internal details of memory management in this chapter, but defer it to Chapter 15.

kmalloc 的真实故事

The Real Story of kmalloc

kmalloc _ 分配引擎是一个强大的工具,并且由于与malloc相似而易于学习。该函数速度很快(除非它阻塞)并且不会清除它获得的内存;分配的区域仍然保留其以前的内容。[ 1 ]分配的区域在物理内存中也是连续的。在接下来的几节中,我们将详细讨论kmalloc,因此您可以将其与我们稍后讨论的内存分配技术进行比较。

The kmalloc allocation engine is a powerful tool and easily learned because of its similarity to malloc. The function is fast (unless it blocks) and doesn't clear the memory it obtains; the allocated region still holds its previous content.[1] The allocated region is also contiguous in physical memory. In the next few sections, we talk in detail about kmalloc, so you can compare it with the memory allocation techniques that we discuss later.

旗帜论证

The Flags Argument

请记住kmalloc的原型是:

Remember that the prototype for kmalloc is:

#include <linux/slab.h>

void *kmalloc(size_t 大小,int 标志);
#include <linux/slab.h>

void *kmalloc(size_t size, int flags);

kmalloc的第一个参数是要分配的块的大小。第二个参数,分配标志,更有趣,因为它以多种方式控制kmalloc的行为。

The first argument to kmalloc is the size of the block to be allocated. The second argument, the allocation flags, is much more interesting, because it controls the behavior of kmalloc in a number of ways.

最常用的标志 ,GFP_KERNEL意味着分配(最终通过调用__get_free_pages在内部执行,这是前缀的来源GFP_)是代表在内核空间中运行的进程执行的。换句话说,这意味着调用函数正在代表进程执行系统调用。使用GFP_KERNEL意味着 kmalloc可以在内存不足的情况下调用时使当前进程进入睡眠状态,等待页面。使用分配内存的函数GFP_KERNEL因此,必须是可重入的并且不能在原子上下文中运行。当当前进程休眠时,内核会采取适当的操作来定位一些空闲内存,方法是将缓冲区刷新到磁盘或从用户进程中换出内存。

The most commonly used flag, GFP_KERNEL, means that the allocation (internally performed by calling, eventually, _ _get_free_pages, which is the source of the GFP_ prefix) is performed on behalf of a process running in kernel space. In other words, this means that the calling function is executing a system call on behalf of a process. Using GFP_KERNEL means that kmalloc can put the current process to sleep waiting for a page when called in low-memory situations. A function that allocates memory using GFP_KERNEL must, therefore, be reentrant and cannot be running in atomic context. While the current process sleeps, the kernel takes proper action to locate some free memory, either by flushing buffers to disk or by swapping out memory from a user process.

GFP_KERNEL并不总是正确的分配标志;有时kmalloc是从进程上下文外部调用的。例如,这种类型的调用可能发生在中断处理程序、微线程和内核定时器中。在这种情况下,current不应将进程置于睡眠状态,驱动程序应使用 的标志GFP_ATOMIC来代替。内核通常会尝试保留一些空闲页面以完成原子分配。当GFP_ATOMIC 使用时,kmalloc甚至可以使用最后一个空闲页面。但是,如果最后一页不存在,则分配失败。

GFP_KERNEL isn't always the right allocation flag to use; sometimes kmalloc is called from outside a process's context. This type of call can happen, for instance, in interrupt handlers, tasklets, and kernel timers. In this case, the current process should not be put to sleep, and the driver should use a flag of GFP_ATOMIC instead. The kernel normally tries to keep some free pages around in order to fulfill atomic allocation. When GFP_ATOMIC is used, kmalloc can use even the last free page. If that last page does not exist, however, the allocation fails.

可以使用其他标志来代替 和 或除了 GFP_KERNEL和之外GFP_ATOMIC,尽管这两个标志满足了设备驱动程序的大部分需求。所有标志都在<linux/gfp.h>中定义,各个标志都以双下划线为前缀,例如_ _GFP_DMA。此外,还有代表常用标志组合的符号;这些缺少前缀,有时称为分配优先级 。后者包括:

Other flags can be used in place of or in addition to GFP_KERNEL and GFP_ATOMIC, although those two cover most of the needs of device drivers. All the flags are defined in <linux/gfp.h>, and individual flags are prefixed with a double underscore, such as _ _GFP_DMA. In addition, there are symbols that represent frequently used combinations of flags; these lack the prefix and are sometimes called allocation priorities . The latter include:

GFP_ATOMIC
GFP_ATOMIC

用于从中断处理程序和进程上下文之外的其他代码分配内存。从不睡觉。

Used to allocate memory from interrupt handlers and other code outside of a process context. Never sleeps.

GFP_KERNEL
GFP_KERNEL

内核内存的正常分配。可以睡觉了。

Normal allocation of kernel memory. May sleep.

GFP_USER
GFP_USER

用于为用户空间页面分配内存;它可能会睡觉。

Used to allocate memory for user-space pages; it may sleep.

GFP_HIGHUSER
GFP_HIGHUSER

与 类似 GFP_USER,但从高端内存分配(如果有)。高级内存将在下一小节中描述。

Like GFP_USER, but allocates from high memory, if any. High memory is described in the next subsection.

GFP_NOIO

GFP_NOFS
GFP_NOIO

GFP_NOFS

这些标志的功能类似于GFP_KERNEL,但它们对内核可以执行哪些操作来满足请求添加了限制。GFP_NOFS不允许分配执行任何文件系统调用,而GFP_NOIO 根本不允许启动任何 I/O。它们主要用于文件系统和虚拟内存代码,其中可以允许分配休眠,但递归文件系统调用将是一个坏主意。

These flags function like GFP_KERNEL, but they add restrictions on what the kernel can do to satisfy the request. A GFP_NOFS allocation is not allowed to perform any filesystem calls, while GFP_NOIO disallows the initiation of any I/O at all. They are used primarily in the filesystem and virtual memory code where an allocation may be allowed to sleep, but recursive filesystem calls would be a bad idea.

上面列出的分配标志可以通过以下任何标志中的 ORing 来扩充,这会改变分配的执行方式:

The allocation flags listed above can be augmented by an ORing in any of the following flags, which change how the allocation is carried out:

_ _GFP_DMA
_ _GFP_DMA

该标志请求在支持 DMA 的内存区域中进行分配。确切的含义取决于平台,并在以下部分中进行解释。

This flag requests allocation to happen in the DMA-capable memory zone. The exact meaning is platform-dependent and is explained in the following section.

_ _GFP_HIGHMEM
_ _GFP_HIGHMEM

该标志表明分配的内存可能位于高端内存。

This flag indicates that the allocated memory may be located in high memory.

_ _GFP_COLD
_ _GFP_COLD

正常情况下,内存 分配器尝试返回“缓存热”页面,即可能在处理器缓存中找到的页面。相反,该标志请求一个“冷”页面,该页面已经有一段时间没有被使用了。它对于分配用于 DMA 读取的页面非常有用,而处理器缓存中的页面则没有用处。有关如何分配 DMA 缓冲区的完整讨论,请参阅第 15 章。

Normally, the memory allocator tries to return "cache warm" pages—pages that are likely to be found in the processor cache. Instead, this flag requests a "cold" page, which has not been used in some time. It is useful for allocating pages for DMA reads, where presence in the processor cache is not useful. See Chapter 15 for a full discussion of how to allocate DMA buffers.

_ _GFP_NOWARN
_ _GFP_NOWARN

这个很少使用的标志可以防止内核在无法满足分配时发出警告(使用 printk )。

This rarely used flag prevents the kernel from issuing warnings (with printk) when an allocation cannot be satisfied.

_ _GFP_HIGH
_ _GFP_HIGH

该标志标记一个高优先级请求,甚至允许消耗内核为紧急情况预留的最后一页内存。

This flag marks a high-priority request, which is allowed to consume even the last pages of memory set aside by the kernel for emergencies.

_ _GFP_REPEAT

_ _GFP_NOFAIL

_ _GFP_NORETRY
_ _GFP_REPEAT

_ _GFP_NOFAIL

_ _GFP_NORETRY

这些标志修改分配器在难以满足分配时的行为方式。_ _GFP_REPEAT意味着通过重复尝试“再努力一点”,但分配仍然可能失败。该_ _GFP_NOFAIL标志告诉分配器永远不会失败;它会根据需要努力工作以满足请求。_ _GFP_NOFAIL强烈建议不要使用;可能永远不会有充分的理由在设备驱动程序中使用它。最后,_ _GFP_NORETRY如果请求的内存不可用,则告诉分配器立即放弃。

These flags modify how the allocator behaves when it has difficulty satisfying an allocation. _ _GFP_REPEAT means "try a little harder" by repeating the attempt—but the allocation can still fail. The _ _GFP_NOFAIL flag tells the allocator never to fail; it works as hard as needed to satisfy the request. Use of _ _GFP_NOFAIL is very strongly discouraged; there will probably never be a valid reason to use it in a device driver. Finally, _ _GFP_NORETRY tells the allocator to give up immediately if the requested memory is not available.

内存区域

Memory zones

_ _GFP_DMA_ _GFP_HIGHMEM具有依赖于平台的角色,尽管它们的使用对所有平台都有效。

Both _ _GFP_DMA and _ _GFP_HIGHMEM have a platform-dependent role, although their use is valid for all platforms.

Linux 内核至少知道三个 内存区域:支持 DMA 的内存、普通内存和高端内存。虽然分配通常发生在正常区域中,但设置刚才提到的任一位都需要从不同的区域分配内存。这个想法是,每个必须了解特殊内存范围(而不是考虑所有 RAM 等效项)的计算机平台都会陷入这种抽象。

The Linux kernel knows about a minimum of three memory zones: DMA-capable memory, normal memory, and high memory. While allocation normally happens in the normal zone, setting either of the bits just mentioned requires memory to be allocated from a different zone. The idea is that every computer platform that must know about special memory ranges (instead of considering all RAM equivalents) will fall into this abstraction.

支持 DMA 的内存是位于优先地址范围内的内存,外设可以在其中执行 DMA 访问。在大多数正常的平台上,所有内存都位于该区域中。在 x86 上,DMA 区域用于前 16 MB RAM,传统 ISA 设备可以在其中执行 DMA;PCI 设备没有这样的限制。

DMA-capable memory is memory that lives in a preferential address range, where peripherals can perform DMA access. On most sane platforms, all memory lives in this zone. On the x86, the DMA zone is used for the first 16 MB of RAM, where legacy ISA devices can perform DMA; PCI devices have no such limit.

高记忆力 是一种用于允许在 32 位平台上访问(相对)大量内存的机制。如果不首先设置特殊映射,则无法从内核直接访问此内存,并且通常更难使用。然而,如果您的驱动程序使用大量内存,那么如果它可以使用高内存,它将在大型系统上工作得更好。有关高级内存如何工作以及如何使用它的详细说明,请参阅第 15 章中的1.8 节。

High memory is a mechanism used to allow access to (relatively) large amounts of memory on 32-bit platforms. This memory cannot be directly accessed from the kernel without first setting up a special mapping and is generally harder to work with. If your driver uses large amounts of memory, however, it will work better on large systems if it can use high memory. See the Section 1.8 in Chapter 15 for a detailed description of how high memory works and how to use it.

每当分配新页面来满足内存分配请求时,内核都会构建可在搜索中使用的区域列表。如果_ _GFP_DMA指定,则仅搜索 DMA 区域:如果低地址处没有可用内存,则分配失败。如果不存在特殊标志,则搜索普通内存和 DMA 内存;如果_ _GFP_HIGHMEM设置,则所有三个区域都用于搜索空闲页面。(但请注意, kmalloc无法分配高端内存。)

Whenever a new page is allocated to fulfill a memory allocation request, the kernel builds a list of zones that can be used in the search. If _ _GFP_DMA is specified, only the DMA zone is searched: if no memory is available at low addresses, allocation fails. If no special flag is present, both normal and DMA memory are searched; if _ _GFP_HIGHMEM is set, all three zones are used to search a free page. (Note, however, that kmalloc cannot allocate high memory.)

情况比较复杂 非均匀内存访问(NUMA)系统。作为一般规则,分配器尝试定位执行分配的处理器本地的内存,尽管有多种方法可以更改该行为。

The situation is more complicated on nonuniform memory access (NUMA) systems. As a general rule, the allocator attempts to locate memory local to the processor performing the allocation, although there are ways of changing that behavior.

记忆背后的机制区域在mm/page_alloc.c中实现,而区域的初始化驻留在特定于平台的文件中,通常位于架构树 内的mm/init.c中。我们将在 第 15 章中重新讨论这些主题。

The mechanism behind memory zones is implemented in mm/page_alloc.c, while initialization of the zone resides in platform-specific files, usually in mm/init.c within the arch tree. We'll revisit these topics in Chapter 15.

尺寸争论

The Size Argument

内核管理 系统的物理 内存,仅在页面大小的块中可用。因此, kmalloc看起来与典型的用户空间 malloc实现有很大不同。简单的、面向堆的分配技术很快就会遇到麻烦;围绕页面边界工作会很困难。因此,内核使用特殊的面向页面的分配技术来充分利用系统的 RAM。

The kernel manages the system's physical memory, which is available only in page-sized chunks. As a result, kmalloc looks rather different from a typical user-space malloc implementation. A simple, heap-oriented allocation technique would quickly run into trouble; it would have a hard time working around the page boundaries. Thus, the kernel uses a special page-oriented allocation technique to get the best use from the system's RAM.

Linux 通过创建一组固定大小的内存对象池来处理内存分配。分配请求的处理方法是进入一个容纳足够大对象的池,并将整个内存块返回给请求者。内存管理方案相当复杂,设备驱动程序编写者通常对它的细节并不那么感兴趣。

Linux handles memory allocation by creating a set of pools of memory objects of fixed sizes. Allocation requests are handled by going to a pool that holds sufficiently large objects and handing an entire memory chunk back to the requester. The memory management scheme is quite complex, and the details of it are not normally all that interesting to device driver writers.

不过,驱动程序开发人员应该记住的一件事是,内核只能分配某些预定义的、固定大小的字节数组。如果您请求任意数量的内存,您获得的内存可能会比您所要求的稍多一些,最多可达两倍。另外,程序员应该记住, kmalloc可以处理的最小分配 大小为 32 或 64 字节,具体取决于系统体系结构使用的页面大小。

The one thing driver developers should keep in mind, though, is that the kernel can allocate only certain predefined, fixed-size byte arrays. If you ask for an arbitrary amount of memory, you're likely to get slightly more than you asked for, up to twice as much. Also, programmers should remember that the smallest allocation that kmalloc can handle is as big as 32 or 64 bytes, depending on the page size used by the system's architecture.

有一个上限 kmalloc可以分配的内存块的大小 。该限制因体系结构和内核配置选项而异。如果您的代码要完全可移植,则不能指望能够分配大于 128 KB 的任何内容。但是,如果您需要超过几千字节,则有比kmalloc更好的方法来获取内存,我们将在本章后面介绍。

There is an upper limit to the size of memory chunks that can be allocated by kmalloc. That limit varies depending on architecture and kernel configuration options. If your code is to be completely portable, it cannot count on being able to allocate anything larger than 128 KB. If you need more than a few kilobytes, however, there are better ways than kmalloc to obtain memory, which we describe later in this chapter.

后备缓存

Lookaside Caches

设备驱动程序通常会结束 一遍又一遍地分配许多相同大小的对象。鉴于内核已经维护了一组大小相同的对象内存池,为什么不为这些大容量对象添加一些特殊的池呢?事实上,内核确实实现了创建此类池的工具,通常称为后备缓存。设备驱动程序通常不会表现出证明使用后备缓存的内存行为,但也可能有例外;Linux 2.6 中的 USB 和 SCSI 驱动程序使用缓存。

A device driver often ends up allocating many objects of the same size, over and over. Given that the kernel already maintains a set of memory pools of objects that are all the same size, why not add some special pools for these high-volume objects? In fact, the kernel does implement a facility to create this sort of pool, which is often called a lookaside cache. Device drivers normally do not exhibit the sort of memory behavior that justifies using a lookaside cache, but there can be exceptions; the USB and SCSI drivers in Linux 2.6 use caches.

Linux 内核中的缓存管理器有时称为“slab 分配器”。因此,它的函数和类型在<linux/slab.h>中声明。lab 分配器实现的缓存具有以下类型 kmem_cache_t:它们是通过调用 kmem_cache_create创建的:

The cache manager in the Linux kernel is sometimes called the "slab allocator." For that reason, its functions and types are declared in <linux/slab.h>. The slab allocator implements caches that have a type of kmem_cache_t; they are created with a call to kmem_cache_create:

kmem_cache_t *kmem_cache_create(const char *name, size_t 大小,
                                size_t 偏移量,
                                无符号长旗,
                                void (*构造函数)(void *, kmem_cache_t *,
                                                    无符号长标志),
                                无效(*析构函数)(无效*,kmem_cache_t *,
                                                   无符号长标志));
kmem_cache_t *kmem_cache_create(const char *name, size_t size,
                                size_t offset, 
                                unsigned long flags,
                                void (*constructor)(void *, kmem_cache_t *,
                                                    unsigned long flags),
                                void (*destructor)(void *, kmem_cache_t *,
                                                   unsigned long flags));

该函数创建一个新的缓存对象,该对象可以托管任意数量的大小相同的内存区域,由参数指定size。该 name参数与此缓存相关联,并用作可用于跟踪问题的内务信息;通常,它被设置为缓存的结构类型的名称。缓存保留一个指向该名称的指针,而不是复制它,因此驱动程序应该传入一个指向静态存储中的名称的指针(通常该名称只是一个文字字符串)。名称不能包含空格。

The function creates a new cache object that can host any number of memory areas all of the same size, specified by the size argument. The name argument is associated with this cache and functions as housekeeping information usable in tracking problems; usually, it is set to the name of the type of structure that is cached. The cache keeps a pointer to the name, rather than copying it, so the driver should pass in a pointer to a name in static storage (usually the name is just a literal string). The name cannot contain blanks.

offset页面中第一个对象的偏移量;它可用于确保分配的对象的特定对齐,但您很可能会使用它0来请求默认值。 flags控制如何完成分配,并且是以下标志的位掩码:

The offset is the offset of the first object in the page; it can be used to ensure a particular alignment for the allocated objects, but you most likely will use 0 to request the default value. flags controls how allocation is done and is a bit mask of the following flags:

SLAB_NO_REAP
SLAB_NO_REAP

设置此标志可以保护缓存在系统寻找内存时不被减少。设置此标志通常是一个坏主意;避免不必要地限制内存分配器的操作自由非常重要。

Setting this flag protects the cache from being reduced when the system is looking for memory. Setting this flag is normally a bad idea; it is important to avoid restricting the memory allocator's freedom of action unnecessarily.

SLAB_HWCACHE_ALIGN
SLAB_HWCACHE_ALIGN

该标志要求每个数据对象是 与缓存行对齐;实际对齐取决于主机平台的缓存布局。如果您的缓存包含在 SMP 计算机上经常访问的项目,则此选项可能是一个不错的选择。然而,实现高速缓存行对齐所需的填充最终可能会浪费大量内存。

This flag requires each data object to be aligned to a cache line; actual alignment depends on the cache layout of the host platform. This option can be a good choice if your cache contains items that are frequently accessed on SMP machines. The padding required to achieve cache line alignment can end up wasting significant amounts of memory, however.

SLAB_CACHE_DMA
SLAB_CACHE_DMA

该标志要求每个数据对象都分配在 DMA 内存区域中。

This flag requires each data object to be allocated in the DMA memory zone.

还有一组标志可以在调试缓存分配期间使用;详细信息请参见mm/slab.c 。然而,通常这些标志是通过用于开发的系统上的内核配置选项全局设置的。

There is also a set of flags that can be used during the debugging of cache allocations; see mm/slab.c for the details. Usually, however, these flags are set globally via a kernel configuration option on systems used for development.

函数的constructordestructor参数是可选函数(但是没有构造函数就不可能有析构函数);前者可用于初始化新分配的对象,后者可用于在对象的内存被整体释放回系统之前“清理”对象。

The constructor and destructor arguments to the function are optional functions (but there can be no destructor without a constructor); the former can be used to initialize newly allocated objects, and the latter can be used to "clean up" objects prior to their memory being released back to the system as a whole.

构造函数和析构函数可能很有用,但您应该记住一些限制。当分配一组对象的内存时,将调用构造函数;因为该内存可能保存多个对象,所以构造函数可能会被多次调用。您不能假设分配对象后会立即调用构造函数。同样,析构函数可以在某个未知的未来时间调用,而不是在对象被释放后立即调用。构造函数和析构函数可能会也可能不会被允许休眠,具体取决于它们是否传递了标志SLAB_CTOR_ATOMIC(其中 是constructorCTOR的缩写 )。

Constructors and destructors can be useful, but there are a few constraints that you should keep in mind. A constructor is called when the memory for a set of objects is allocated; because that memory may hold several objects, the constructor may be called multiple times. You cannot assume that the constructor will be called as an immediate effect of allocating an object. Similarly, destructors can be called at some unknown future time, not immediately after an object has been freed. Constructors and destructors may or may not be allowed to sleep, according to whether they are passed the SLAB_CTOR_ATOMIC flag (where CTOR is short for constructor).

为了方便起见,程序员可以对构造函数和析构函数使用相同的函数;SLAB_CTOR_CONSTRUCTOR当被调用者是构造函数时,slab 分配器总是传递该标志。

For convenience, a programmer can use the same function for both the constructor and destructor; the slab allocator always passes the SLAB_CTOR_CONSTRUCTOR flag when the callee is a constructor.

创建对象缓存后,您可以通过调用 kmem_cache_alloc从中分配对象:

Once a cache of objects is created, you can allocate objects from it by calling kmem_cache_alloc:

void *kmem_cache_alloc(kmem_cache_t *cache, int flags);
void *kmem_cache_alloc(kmem_cache_t *cache, int flags);

在这里,cache 参数是您之前创建的缓存;这些标志与传递给kmalloc 的标志相同, 如果kmem_cache_alloc 需要出去并自行分配更多内存,则会进行咨询。

Here, the cache argument is the cache you have created previously; the flags are the same as you would pass to kmalloc and are consulted if kmem_cache_alloc needs to go out and allocate more memory itself.

要释放对象,请使用kmem_cache_free

To free an object, use kmem_cache_free:

void kmem_cache_free(kmem_cache_t *cache, const void *obj);
 void kmem_cache_free(kmem_cache_t *cache, const void *obj);

当驱动程序代码完成缓存时,通常是在卸载模块时,它应该按如下方式释放其缓存:

When driver code is finished with the cache, typically when the module is unloaded, it should free its cache as follows:

int kmem_cache_destroy(kmem_cache_t *cache);
 int kmem_cache_destroy(kmem_cache_t *cache);

仅当从缓存分配的所有对象都已返回给缓存时,销毁操作才会成功。因此,模块应该检查 kmem_cache_destroy的返回状态;失败表明模块内存在某种内存泄漏(因为某些对象已被删除)。

The destroy operation succeeds only if all objects allocated from the cache have been returned to it. Therefore, a module should check the return status from kmem_cache_destroy; a failure indicates some sort of memory leak within the module (since some of the objects have been dropped).

使用后备缓存的一个好处是内核可以维护缓存使用情况的统计信息。这些统计信息可以从/proc/slabinfo获得。

One side benefit to using lookaside caches is that the kernel maintains statistics on cache usage. These statistics may be obtained from /proc/slabinfo.

基于Slab Cache的scull:scullc

A scull Based on the Slab Caches: scullc

是时候举个例子了。scullc是scull模块的精简版本 ,它仅实现裸设备——持久内存区域。与使用kmalloc 的scull不同, scullc使用内存缓存。量子的大小可以在编译时和加载时修改,但不能在运行时修改——这需要创建新的内存缓存,而我们不想处理这些不需要的细节。

Time for an example. scullc is a cut-down version of the scull module that implements only the bare device—the persistent memory region. Unlike scull, which uses kmalloc, scullc uses memory caches. The size of the quantum can be modified at compile time and at load time, but not at runtime—that would require creating a new memory cache, and we didn't want to deal with these unneeded details.

scullc是一个完整的示例,可用于尝试平板分配器。它与scull 的区别仅在于几行代码。首先,我们必须声明我们自己的slab缓存:

scullc is a complete example that can be used to try out the slab allocator. It differs from scull only in a few lines of code. First, we must declare our own slab cache:

/* 声明一个缓存指针:将其用于所有设备 */
kmem_cache_t *scullc_cache;
/* declare one cache pointer: use it for all devices */
kmem_cache_t *scullc_cache;

slab 缓存的创建是这样处理的(在模块加载时):

The creation of the slab cache is handled (at module load time) in this way:

/* scullc_init: 为我们的量子创建一个缓存 */
scullc_cache = kmem_cache_create("scullc", scullc_quantum,
        0、SLAB_HWCACHE_ALIGN、空、空);/* 没有 ctor/dtor */
如果(!scullc_cache){
    scullc_cleanup();
    返回-ENOMEM;
}
/* scullc_init: create a cache for our quanta */
scullc_cache = kmem_cache_create("scullc", scullc_quantum,
        0, SLAB_HWCACHE_ALIGN, NULL, NULL); /* no ctor/dtor */
if (!scullc_cache) {
    scullc_cleanup(  );
    return -ENOMEM;
}

这是它分配内存量子的方式:

This is how it allocates memory quanta:

/* 使用内存缓存分配一个量程 */
if (!dptr->data[s_pos]) {
    dptr->data[s_pos] = kmem_cache_alloc(scullc_cache, GFP_KERNEL);
    if (!dptr->data[s_pos])
        转到诺梅姆;
    memset(dptr->data[s_pos], 0, scullc_quantum);
}
/* Allocate a quantum using the memory cache */
if (!dptr->data[s_pos]) {
    dptr->data[s_pos] = kmem_cache_alloc(scullc_cache, GFP_KERNEL);
    if (!dptr->data[s_pos])
        goto nomem;
    memset(dptr->data[s_pos], 0, scullc_quantum);
}

这些行释放内存:

And these lines release memory:

for (i = 0; i < qset; i++)
if (dptr->data[i])
        kmem_cache_free(scullc_cache, dptr->data[i]);
for (i = 0; i < qset; i++)
if (dptr->data[i])
        kmem_cache_free(scullc_cache, dptr->data[i]);

最后,在模块卸载时,我们必须将缓存返回给系统:

Finally, at module unload time, we have to return the cache to the system:

/* scullc_cleanup: 释放我们量子的缓存 */
如果(scullc_cache)
    kmem_cache_destroy(scullc_cache);
/* scullc_cleanup: release the cache of our quanta */
if (scullc_cache)
    kmem_cache_destroy(scullc_cache);

从scull传递到 scullc的主要区别是速度略有提高和更好的内存使用。由于量子是从大小正确的内存碎片池中分配的,因此它们在内存中的放置尽可能密集,这与scull量子相反,后者会 带来不可预测的内存碎片。

The main differences in passing from scull to scullc are a slight speed improvement and better memory use. Since quanta are allocated from a pool of memory fragments of exactly the right size, their placement in memory is as dense as possible, as opposed to scull quanta, which bring in an unpredictable memory fragmentation.

内存池

Memory Pools

有一些地方在 不允许内存分配失败的内核。作为保证这些情况下分配的一种方法,内核开发人员创建了一个称为内存池(或“mempool”)的抽象。内存池实际上只是后备缓存的一种形式,它试图始终保留可用内存列表以供紧急情况使用。

There are places in the kernel where memory allocations cannot be allowed to fail. As a way of guaranteeing allocations in those situations, the kernel developers created an abstraction known as a memory pool (or "mempool"). A memory pool is really just a form of a lookaside cache that tries to always keep a list of free memory around for use in emergencies.

内存池的类型为mempool_t(在 <linux/mempool.h>中定义);您可以使用mempool_create创建一个 :

A memory pool has a type of mempool_t (defined in <linux/mempool.h>); you can create one with mempool_create:

mempool_t *mempool_create(int min_nr,
                          mempool_alloc_t *alloc_fn,
                          mempool_free_t *free_fn,
                          无效*池数据);
mempool_t *mempool_create(int min_nr, 
                          mempool_alloc_t *alloc_fn,
                          mempool_free_t *free_fn, 
                          void *pool_data);

参数min_nr是池应始终保留的分配对象的最小数量。alloc_fn对象的实际分配和释放由和处理free_fn,它具有以下原型:

The min_nr argument is the minimum number of allocated objects that the pool should always keep around. The actual allocation and freeing of objects is handled by alloc_fn and free_fn, which have these prototypes:

typedef void *(mempool_alloc_t)(int gfp_mask, void *pool_data);
typedef void (mempool_free_t)(void *element, void *pool_data);
typedef void *(mempool_alloc_t)(int gfp_mask, void *pool_data);
typedef void (mempool_free_t)(void *element, void *pool_data);

mempool_create ( )的最后一个参数pool_data被传递给alloc_fnfree_fn

The final parameter to mempool_create (pool_data) is passed to alloc_fn and free_fn.

如果需要,您可以编写专用函数来处理内存池的内存分配。然而,通常情况下,您只想让内核板分配器为您处理该任务。有两个函数(mempool_alloc_slabmempool_free_slab )执行内存池分配原型与kmem_cache_allockmem_cache_free之间的阻抗匹配。因此,设置内存池的代码通常如下所示:

If need be, you can write special-purpose functions to handle memory allocations for mempools. Usually, however, you just want to let the kernel slab allocator handle that task for you. There are two functions (mempool_alloc_slab and mempool_free_slab) that perform the impedance matching between the memory pool allocation prototypes and kmem_cache_alloc and kmem_cache_free. Thus, code that sets up memory pools often looks like the following:

缓存 = kmem_cache_create(. . .);
池 = mempool_create(MY_POOL_MINIMUM,
                      mempool_alloc_slab、mempool_free_slab、
                      缓存);
cache = kmem_cache_create(. . .);
pool = mempool_create(MY_POOL_MINIMUM,
                      mempool_alloc_slab, mempool_free_slab,
                      cache);

创建池后,可以通过以下方式分配和释放对象:

Once the pool has been created, objects can be allocated and freed with:

void *mempool_alloc(mempool_t *pool, int gfp_mask);
void mempool_free(void *element, mempool_t *pool);
void *mempool_alloc(mempool_t *pool, int gfp_mask);
void mempool_free(void *element, mempool_t *pool);

创建内存池时,分配函数将被调用足够多次来创建预分配对象池。此后,调用 mempool_alloc尝试从分配函数获取其他对象;如果分配失败,则返回预分配的对象之一(如果还有剩余)。当使用mempool_free释放对象时,如果当前预分配对象的数量低于最小值,则该对象将保留在池中;否则,将被返回到系统中。

When the mempool is created, the allocation function will be called enough times to create a pool of preallocated objects. Thereafter, calls to mempool_alloc attempt to acquire additional objects from the allocation function; should that allocation fail, one of the preallocated objects (if any remain) is returned. When an object is freed with mempool_free, it is kept in the pool if the number of preallocated objects is currently below the minimum; otherwise, it is to be returned to the system.

内存池可以通过以下方式调整大小:

A mempool can be resized with:

int mempool_resize(mempool_t *pool, int new_min_nr, int gfp_mask);
int mempool_resize(mempool_t *pool, int new_min_nr, int gfp_mask);

如果成功,此调用将调整池的大小以至少包含new_min_nr对象。

This call, if successful, resizes the pool to have at least new_min_nr objects.

如果您不再需要内存池,请使用以下命令将其返回给系统:

If you no longer need a memory pool, return it to the system with:

无效 mempool_destroy(mempool_t *pool);
void mempool_destroy(mempool_t *pool);

您必须在销毁内存池之前返回所有已分配的对象,否则会导致内核错误。

You must return all allocated objects before destroying the mempool, or a kernel oops results.

如果您正在考虑在驱动程序中使用内存池,请记住一件事:内存池分配位于列表中的一块内存,该内存块处于空闲状态并且无法用于任何实际使用。内存池很容易消耗大量内存。几乎在所有情况下,首选的替代方案是不使用内存池,而是简单地处理分配失败的可能性。如果您的驱动程序有任何方法可以以不危及系统完整性的方式响应分配失败,请按照这种方式进行操作。在驱动程序代码中使用内存池应该很少。

If you are considering using a mempool in your driver, please keep one thing in mind: mempools allocate a chunk of memory that sits in a list, idle and unavailable for any real use. It is easy to consume a great deal of memory with mempools. In almost every case, the preferred alternative is to do without the mempool and simply deal with the possibility of allocation failures instead. If there is any way for your driver to respond to an allocation failure in a way that does not endanger the integrity of the system, do things that way. Use of mempools in driver code should be rare.

get_free_page 和朋友

get_free_page and Friends

如果一个模块需要分配大块内存,通常最好使用面向页面的技术。请求整页还有其他优点,第15 章将介绍这些优点。

If a module needs to allocate big chunks of memory, it is usually better to use a page-oriented technique. Requesting whole pages also has other advantages, which are introduced in Chapter 15.

要分配页面,可以使用以下函数:

To allocate pages, the following functions are available:

get_zeroed_page(unsigned int flags);
get_zeroed_page(unsigned int flags);

返回指向新页面的指针并用零填充该页面。

Returns a pointer to a new page and fills the page with zeros.

_ _get_free_page(unsigned int flags);
_ _get_free_page(unsigned int flags);

get_zeroed_pa​​ge类似,但不清除页面。

Similar to get_zeroed_page, but doesn't clear the page.

_ _get_free_pages(unsigned int flags, unsigned int order);
_ _get_free_pages(unsigned int flags, unsigned int order);

分配并返回指向内存区域第一个字节的指针,该内存区域可能有几个(物理上连续的)页长,但不会将该区域归零。

Allocates and returns a pointer to the first byte of a memory area that is potentially several (physically contiguous) pages long but doesn't zero the area.

该 参数的工作方式与kmallocflags相同 ;通常使用 或,也许会添加标志 (对于可用于 ISA 直接内存访问操作的内存)或者当可以使用高端内存时。[ 2 ]是您请求或释放的页面数的以 2 为底的对数(即 log 2 N)。例如,如果您想要一页, 如果您请求八页。如果太大(没有该大小的连续区域可用),则页面分配失败。获取订单 GFP_KERNELGFP_ATOMIC_ _GFP_DMA_ _GFP_HIGHMEM order order03order函数采用整数参数,可用于从托管平台的大小(必须是 2 的幂)中提取顺序。最大允许值为order1011对应于 1024 或 2048 页),具体取决于体系结构。然而,除了新启动的具有大量内存的系统之外,10 阶分配成功的机会很小。

The flags argument works in the same way as with kmalloc; usually either GFP_KERNEL or GFP_ATOMIC is used, perhaps with the addition of the _ _GFP_DMA flag (for memory that can be used for ISA direct-memory-access operations) or _ _GFP_HIGHMEM when high memory can be used.[2] order is the base-two logarithm of the number of pages you are requesting or freeing (i.e., log2 N). For example, order is 0 if you want one page and 3 if you request eight pages. If order is too big (no contiguous area of that size is available), the page allocation fails. The get_order function, which takes an integer argument, can be used to extract the order from a size (that must be a power of two) for the hosting platform. The maximum allowed value for order is 10 or 11 (corresponding to 1024 or 2048 pages), depending on the architecture. The chances of an order-10 allocation succeeding on anything other than a freshly booted system with a lot of memory are small, however.

如果您好奇,/proc/buddyinfo会告诉您系统上每个内存区域每个顺序有多少个块可用。

If you are curious, /proc/buddyinfo tells you how many blocks of each order are available for each memory zone on the system.

当程序完成页面处理后,它可以使用以下函数之一释放它们。第一个函数是一个依赖于第二个函数的宏:

When a program is done with the pages, it can free them with one of the following functions. The first function is a macro that falls back on the second:

void free_page(无符号长地址);
void free_pages(无符号长地址,无符号长顺序);
void free_page(unsigned long addr);
void free_pages(unsigned long addr, unsigned long order);

如果您尝试从分配的页面中释放不同数量的页面,则内存映射会损坏,并且系统稍后会遇到麻烦。

If you try to free a different number of pages from what you allocated, the memory map becomes corrupted, and the system gets in trouble at a later time.

值得强调的是_ _get_free_pages和其他函数可以随时调用,遵循我们在kmalloc中看到的相同规则。在某些情况下,这些函数可能无法分配内存,特别是在GFP_ATOMIC使用 时。因此,调用这些分配函数的程序必须准备好处理分配失败。

It's worth stressing that _ _get_free_pages and the other functions can be called at any time, subject to the same rules we saw for kmalloc. The functions can fail to allocate memory in certain circumstances, particularly when GFP_ATOMIC is used. Therefore, the program calling these allocation functions must be prepared to handle an allocation failure.

尽管kmalloc(GFP_KERNEL)有时会因没有可用内存而失败,但内核会尽力满足分配请求。因此,分配过多的内存很容易降低系统响应能力。例如,您可以通过将太多数据推送到scull设备来关闭计算机;系统开始爬行,同时尝试尽可能多地交换以满足 kmalloc请求。由于所有资源都被不断增长的设备所占用,计算机很快就会变得无法使用;到那时,您甚至无法再启动新流程来尝试解决该问题。我们不在 scull中解决这个问题,因为它只是一个示例模块,而不是放入多用户系统的真正工具。尽管如此,作为一名程序员,您必须小心,因为模块是特权代码,可以在系统中打开新的安全漏洞(最有可能的是像刚刚概述的那样的拒绝服务漏洞)。

Although kmalloc(GFP_KERNEL) sometimes fails when there is no available memory, the kernel does its best to fulfill allocation requests. Therefore, it's easy to degrade system responsiveness by allocating too much memory. For example, you can bring the computer down by pushing too much data into a scull device; the system starts crawling while it tries to swap out as much as possible in order to fulfill the kmalloc request. Since every resource is being sucked up by the growing device, the computer is soon rendered unusable; at that point, you can no longer even start a new process to try to deal with the problem. We don't address this issue in scull, since it is just a sample module and not a real tool to put into a multiuser system. As a programmer, you must be careful nonetheless, because a module is privileged code and can open new security holes in the system (the most likely is a denial-of-service hole like the one just outlined).

使用整个页面的双桨:scullp

A scull Using Whole Pages: scullp

为了真实测试页面分配,我们发布了scullp模块以及其他示例代码。它是一个简化的scul,就像 之前介绍的scullc一样。

In order to test page allocation for real, we have released the scullp module together with other sample code. It is a reduced scull, just like scullc introduced earlier.

scullp分配的内存量子是整个页面或页面集:该scullp_order变量默认为0,但可以在编译或加载时更改。

Memory quanta allocated by scullp are whole pages or page sets: the scullp_order variable defaults to 0 but can be changed at either compile or load time.

以下几行显示了它如何分配内存:

The following lines show how it allocates memory:

/* 这是单个量子的分配 */
if (!dptr->data[s_pos]) {
    dptr->数据[s_pos] =
        (void *)_ _get_free_pages(GFP_KERNEL, dptr->order);
    if (!dptr->data[s_pos])
        转到诺梅姆;
    memset(dptr->data[s_pos], 0, PAGE_SIZE << dptr->order);
}
/* Here's the allocation of a single quantum */
if (!dptr->data[s_pos]) {
    dptr->data[s_pos] =
        (void *)_ _get_free_pages(GFP_KERNEL, dptr->order);
    if (!dptr->data[s_pos])
        goto nomem;
    memset(dptr->data[s_pos], 0, PAGE_SIZE << dptr->order);
}

scullp中释放内存的代码如下所示:

The code to deallocate memory in scullp looks like this:

/* 此代码释放整个量子集 */
for (i = 0; i < qset; i++)
    if (dptr->data[i])
        free_pages((无符号长整型)(dptr->data[i]),
                dptr->顺序);
/* This code frees a whole quantum-set */
for (i = 0; i < qset; i++)
    if (dptr->data[i])
        free_pages((unsigned long)(dptr->data[i]),
                dptr->order);

在用户层面,感知到的差异主要是速度的提高和更好的内存使用,因为没有内存的内部碎片。我们运行了一些测试,将 4 MB 从scull0复制到scull1,然后从scullp0复制到 scullp1;结果显示内核空间处理器使用率略有改善。

At the user level, the perceived difference is primarily a speed improvement and better memory use, because there is no internal fragmentation of memory. We ran some tests copying 4 MB from scull0 to scull1 and then from scullp0 to scullp1; the results showed a slight improvement in kernel-space processor usage.

性能提升并不显着,因为kmalloc的设计速度很快。页级分配的主要优点实际上不是速度,而是更有效的内存使用。按页分配不会浪费内存,而使用 kmalloc由于分配粒度的原因会浪费不可预测的内存量。

The performance improvement is not dramatic, because kmalloc is designed to be fast. The main advantage of page-level allocation isn't actually speed, but rather more efficient memory usage. Allocating by pages wastes no memory, whereas using kmalloc wastes an unpredictable amount of memory because of allocation granularity.

但_ _get_free_page函数的最大优点是获得的页面完全属于您,理论上您可以通过适当调整页表将页面组装成线性区域。例如,您可以允许用户进程映射作为单个不相关页面获得的内存区域。我们在第 15 章中讨论这种操作,其中我们展示了scullp如何提供内存映射,这是scull无法提供的。

But the biggest advantage of the _ _get_free_page functions is that the pages obtained are completely yours, and you could, in theory, assemble the pages into a linear area by appropriate tweaking of the page tables. For example, you can allow a user process to mmap memory areas obtained as single unrelated pages. We discuss this kind of operation in Chapter 15, where we show how scullp offers memory mapping, something that scull cannot offer.

alloc_pages 接口

The alloc_pages Interface

为了完整起见,我们引入 另一个用于内存分配的接口,尽管我们要到 第 15 章之后才会准备使用它。现在,只要说这 struct page是一个描述内存页面的内部内核结构就足够了。正如我们将看到的,内核中有很多地方需要使用page结构体;它们在您可能处理高端内存(在内核空间中没有恒定地址)的任何情况下特别有用。

For completeness, we introduce another interface for memory allocation, even though we will not be prepared to use it until after Chapter 15. For now, suffice it to say that struct page is an internal kernel structure that describes a page of memory. As we will see, there are many places in the kernel where it is necessary to work with page structures; they are especially useful in any situation where you might be dealing with high memory, which does not have a constant address in kernel space.

Linux的真正核心页面分配器是一个名为alloc_pages_node的函数:

The real core of the Linux page allocator is a function called alloc_pages_node:

结构页 *alloc_pages_node(int nid, unsigned int flags,
                              无符号整型顺序);
struct page *alloc_pages_node(int nid, unsigned int flags, 
                              unsigned int order);

该函数还有两个变体(只是宏);这些是您最有可能使用的版本:

This function also has two variants (which are simply macros); these are the versions that you will most likely use:

struct page *alloc_pages(unsigned int flags, unsigned int order);
结构页 *alloc_page(unsigned int flags);
struct page *alloc_pages(unsigned int flags, unsigned int order);
struct page *alloc_page(unsigned int flags);

核心函数alloc_pages_node采用三个参数。 是应分配内存的nidNUMA 节点 ID [ 3 ]flags ,是通常的GFP_分配标志,order是分配的大小。返回值是指向page描述分配内存的第一个(可能是多个)结构的指针,或者像往常一样,NULL在失败时返回。

The core function, alloc_pages_node, takes three arguments. nid is the NUMA node ID[3] whose memory should be allocated, flags is the usual GFP_ allocation flags, and order is the size of the allocation. The return value is a pointer to the first of (possibly many) page structures describing the allocated memory, or, as usual, NULL on failure.

alloc_pages通过在当前 NUMA 节点上分配内存来简化情况(它以numa_node_id 的返回值作为参数调用alloc_pages_node )。当然, alloc_page省略了该参数并分配单个页面。nidorder

alloc_pages simplifies the situation by allocating the memory on the current NUMA node (it calls alloc_pages_node with the return value from numa_node_id as the nid parameter). And, of course, alloc_page omits the order parameter and allocates a single page.

要释放以这种方式分配的页面,您应该使用以下方法之一:

To release pages allocated in this manner, you should use one of the following:

void _ _free_page(struct page *page);
void _ _free_pages(struct page *page, unsigned int order);
void free_hot_page(struct page *page);
void free_cold_page(struct page *page);
void _ _free_page(struct page *page);
void _ _free_pages(struct page *page, unsigned int order);
void free_hot_page(struct page *page);
void free_cold_page(struct page *page);

如果您具体了解单个页面的内容是否可能驻留在处理器缓存中,则应该使用free_hot_page(对于缓存驻留页面)或 free_cold_page将其传达给内核 。此信息有助于内存分配器 优化其在整个系统中的内存使用。

If you have specific knowledge of whether a single page's contents are likely to be resident in the processor cache, you should communicate that to the kernel with free_hot_page (for cache-resident pages) or free_cold_page. This information helps the memory allocator optimize its use of memory across the system.

vmalloc 和朋友

vmalloc and Friends

下一次内存分配 我们向您展示的函数是vmalloc ,它在虚拟地址空间中分配连续的内存区域。尽管这些页面在物理内存中不是连续的(每个页面都是通过单独调用 alloc_page来检索的),但内核将它们视为连续的地址范围。 如果发生错误, vmalloc返回0NULL地址),否则,它返回一个指向大小至少为 的线性内存区域的指针size

The next memory allocation function that we show you is vmalloc, which allocates a contiguous memory region in the virtual address space. Although the pages are not consecutive in physical memory (each page is retrieved with a separate call to alloc_page), the kernel sees them as a contiguous range of addresses. vmalloc returns 0 (the NULL address) if an error occurs, otherwise, it returns a pointer to a linear memory area of size at least size.

我们在这里描述vmalloc 是因为它是基本的 Linux 内存分配机制之一。然而,我们应该注意, 在大多数情况下不鼓励使用vmalloc 。从vmalloc获得的内存 使用效率稍低,并且在某些体系结构上,为vmalloc预留的地址空间量相对较小。如果将使用vmalloc的代码提交到内核中,则可能会受到冷遇。如果可能,您应该直接处理各个页面,而不是尝试使用 vmalloc来解决问题。

We describe vmalloc here because it is one of the fundamental Linux memory allocation mechanisms. We should note, however, that use of vmalloc is discouraged in most situations. Memory obtained from vmalloc is slightly less efficient to work with, and, on some architectures, the amount of address space set aside for vmalloc is relatively small. Code that uses vmalloc is likely to get a chilly reception if submitted for inclusion in the kernel. If possible, you should work directly with individual pages rather than trying to smooth things over with vmalloc.

也就是说,让我们看看vmalloc是如何工作的。该函数及其相关函数的原型(ioremap严格来说不是一个分配函数,将在本节后面讨论)如下:

That said, let's see how vmalloc works. The prototypes of the function and its relatives (ioremap, which is not strictly an allocation function, is discussed later in this section) are as follows:

#include <linux/vmalloc.h>

void *vmalloc(无符号长尺寸);
无效vfree(无效*地址);
void *ioremap(无符号长偏移量,无符号长大小);
无效iounmap(无效*地址);
#include <linux/vmalloc.h>

void *vmalloc(unsigned long size);
void vfree(void * addr);
void *ioremap(unsigned long offset, unsigned long size);
void iounmap(void * addr);

值得强调的是,kmalloc_ get_free_pages返回的内存地址也是虚拟地址。它们的实际值在用于寻址物理内存之前仍然由 MMU(内存管理单元,通常是 CPU 的一部分)进行处理。[ 4 ] vmalloc 的不同之处并不在于它如何使用硬件,而在于内核如何执行分配任务。

It's worth stressing that memory addresses returned by kmalloc and _ get_free_pages are also virtual addresses. Their actual value is still massaged by the MMU (the memory management unit, usually part of the CPU) before it is used to address physical memory.[4] vmalloc is not different in how it uses the hardware, but rather in how the kernel performs the allocation task.

kmalloc_ get_free_pages使用的(虚拟)地址范围具有到物理内存的一对一映射,可能会移位一个常PAGE_OFFSET量值;这些函数不需要修改该地址范围的页表。另一方面,vmallocioremap使用的地址范围 是完全合成的,每次分配都会通过适当设置页表来构建(虚拟)内存区域。

The (virtual) address range used by kmalloc and _ _get_free_pages features a one-to-one mapping to physical memory, possibly shifted by a constant PAGE_OFFSET value; the functions don't need to modify the page tables for that address range. The address range used by vmalloc and ioremap, on the other hand, is completely synthetic, and each allocation builds the (virtual) memory area by suitably setting up the page tables.

通过比较分配函数返回的指针可以看出这种差异。在某些平台(例如 x86)上, vmalloc返回的地址刚好超出 kmalloc使用的地址。在其他平台(例如,MIPS、IA-64 和 x86_64)上,它们属于完全不同的地址范围。可用于vmalloc的地址 范围为VMALLOC_STARTVMALLOC_END这两个符号都在<asm/pgtable.h>中定义。

This difference can be perceived by comparing the pointers returned by the allocation functions. On some platforms (for example, the x86), addresses returned by vmalloc are just beyond the addresses that kmalloc uses. On other platforms (for example, MIPS, IA-64, and x86_64), they belong to a completely different address range. Addresses available for vmalloc are in the range from VMALLOC_START to VMALLOC_END. Both symbols are defined in <asm/pgtable.h>.

由vmalloc分配的地址不能在微处理器外部使用,因为它们仅在处理器的 MMU 之上才有意义。当驱动程序需要真实的物理地址(例如 DMA 地址,由外围硬件用来驱动系统总线)时,您不能轻易使用vmalloc。调用 vmalloc 的正确时机是为仅存在于软件中的大型顺序缓冲区分配内存时。值得注意的是 vmalloc比_ _get_free_pages有更多的开销,因为它必须检索内存并构建页表。因此,调用vmalloc没有意义仅分配一页。

Addresses allocated by vmalloc can't be used outside of the microprocessor, because they make sense only on top of the processor's MMU. When a driver needs a real physical address (such as a DMA address, used by peripheral hardware to drive the system's bus), you can't easily use vmalloc. The right time to call vmalloc is when you are allocating memory for a large sequential buffer that exists only in software. It's important to note that vmalloc has more overhead than _ _get_free_pages, because it must both retrieve the memory and build the page tables. Therefore, it doesn't make sense to call vmalloc to allocate just one page.

内核中使用vmalloc的函数示例是 create_module 系统调用,它使用vmalloc 为正在创建的模块获取空间。随后使用copy_from_user将模块的代码和数据复制到分配的空间。通过这种方式,模块看起来被加载到连续的内存中。您可以通过查看/proc/kallsyms来验证模块导出的内核符号与内核本身导出的符号位于不同的内存范围中。

An example of a function in the kernel that uses vmalloc is the create_module system call, which uses vmalloc to get space for the module being created. Code and data of the module are later copied to the allocated space using copy_from_user. In this way, the module appears to be loaded into contiguous memory. You can verify, by looking in /proc/kallsyms, that kernel symbols exported by modules lie in a different memory range from symbols exported by the kernel proper.

使用vmalloc分配的内存由vfree释放 ,在与kfree释放kmalloc分配的内存的 方式相同 。

Memory allocated with vmalloc is released by vfree, in the same way that kfree releases memory allocated by kmalloc.

与vmalloc一样,ioremap构建新的页表;然而,与vmalloc不同的是,它实际上并不分配任何内存。ioremap的返回值是一个特殊的虚拟地址,可以用来访问指定的物理地址范围;获得的虚拟地址最终通过调用iounmap释放。

Like vmalloc, ioremap builds new page tables; unlike vmalloc, however, it doesn't actually allocate any memory. The return value of ioremap is a special virtual address that can be used to access the specified physical address range; the virtual address obtained is eventually released by calling iounmap.

ioremap对于将 PCI 缓冲区的(物理)地址映射到(虚拟)内核空间最有用。例如,可用于访问PCI视频设备的帧缓冲区;此类缓冲区通常映射到高物理地址,超出内核在启动时构建页表的地址范围。PCI 问题在第 12 章中有更详细的解释。

ioremap is most useful for mapping the (physical) address of a PCI buffer to (virtual) kernel space. For example, it can be used to access the frame buffer of a PCI video device; such buffers are usually mapped at high physical addresses, outside of the address range for which the kernel builds page tables at boot time. PCI issues are explained in more detail in Chapter 12.

值得注意的是,为了可移植性,您不应该直接访问ioremap返回的地址 就好像它们是指向内存的指针一样。相反,您应该始终使用readb和第 9 章中介绍的其他 I/O 函数。之所以适用此要求,是因为某些平台(例如 Alpha)无法直接将 PCI 内存区域映射到处理器地址空间,因为 PCI 规范和 Alpha 处理器之间在数据传输方式方面存在差异。

It's worth noting that for the sake of portability, you should not directly access addresses returned by ioremap as if they were pointers to memory. Rather, you should always use readb and the other I/O functions introduced in Chapter 9. This requirement applies because some platforms, such as the Alpha, are unable to directly map PCI memory regions to the processor address space because of differences between PCI specs and Alpha processors in how data is transferred.

ioremap和vmalloc都是面向页的(它们通过修改页表来工作)因此,重定位或分配的大小将向上舍入到最近的页边界。ioremap通过“向下舍入”要重新映射的地址并返回第一个重新映射页的偏移量来模拟未对齐的映射。

Both ioremap and vmalloc are page oriented (they work by modifying the page tables); consequently, the relocated or allocated size is rounded up to the nearest page boundary. ioremap simulates an unaligned mapping by "rounding down" the address to be remapped and by returning an offset into the first remapped page.

vmalloc的一个小缺点是它不能在原子上下文中使用,因为它在内部使用kmalloc(GFP_KERNEL)来获取页表的存储空间,因此可能会休眠。这应该不是问题——如果_ _get_free_page的使用对于中断处理程序来说不够好,则软件设计需要进行一些清理。

One minor drawback of vmalloc is that it can't be used in atomic context because, internally, it uses kmalloc(GFP_KERNEL) to acquire storage for the page tables, and therefore could sleep. This shouldn't be a problem—if the use of _ _get_free_page isn't good enough for an interrupt handler, the software design needs some cleaning up.

使用虚拟地址的 scull:scullv

A scull Using Virtual Addresses: scullv

scullv模块中提供了使用vmalloc的示例代码 。与scullp一样,该模块是scull的精简版本 ,它使用不同的分配函数来获取设备存储数据的空间。

Sample code using vmalloc is provided in the scullv module. Like scullp, this module is a stripped-down version of scull that uses a different allocation function to obtain space for the device to store data.

该模块一次分配内存 16 页。分配以大块的形式完成,以获得比scullp更好的性能,并显示其他分配技术需要很长时间才能实现的功能。使用_ _get_free_pages分配多个页面很容易失败,即使成功,速度也可能很慢。正如我们之前所看到的,vmalloc在分配多个页面时比其他函数更快,但在检索单个页面时稍慢一些,因为页表构建的开销。scullv的设计类似于scullporder 指定每个分配的“顺序”,默认为 4。 scullvscullp之间的唯一区别在于分配管理。这些行使用vmalloc来获取新内存:

The module allocates memory 16 pages at a time. The allocation is done in large chunks to achieve better performance than scullp and to show something that takes too long with other allocation techniques to be feasible. Allocating more than one page with _ _get_free_pages is failure prone, and even when it succeeds, it can be slow. As we saw earlier, vmalloc is faster than other functions in allocating several pages, but somewhat slower when retrieving a single page, because of the overhead of page-table building. scullv is designed like scullp. order specifies the "order" of each allocation and defaults to 4. The only difference between scullv and scullp is in allocation management. These lines use vmalloc to obtain new memory:

/* 使用虚拟地址分配一个量程 */
if (!dptr->data[s_pos]) {
    dptr->数据[s_pos] =
        (void *)vmalloc(PAGE_SIZE << dptr->order);
    if (!dptr->data[s_pos])
        转到诺梅姆;
    memset(dptr->data[s_pos], 0, PAGE_SIZE << dptr->order);
}
/* Allocate a quantum using virtual addresses */
if (!dptr->data[s_pos]) {
    dptr->data[s_pos] =
        (void *)vmalloc(PAGE_SIZE << dptr->order);
    if (!dptr->data[s_pos])
        goto nomem;
    memset(dptr->data[s_pos], 0, PAGE_SIZE << dptr->order);
}

这些行释放内存:

and these lines release memory:

/* 释放量子集 */
for (i = 0; i < qset; i++)
    if (dptr->data[i])
        vfree(dptr->data[i]);
/* Release the quantum-set */
for (i = 0; i < qset; i++)
    if (dptr->data[i])
        vfree(dptr->data[i]);

如果您在启用调试的情况下编译这两个模块,则可以通过读取它们在/proc中创建的文件来查看它们的数据分配。此快照是在 x86_64 系统上拍摄的:

If you compile both modules with debugging enabled, you can look at their data allocation by reading the files they create in /proc. This snapshot was taken on an x86_64 system:

salma% cat /tmp/bigfile > /dev/scullp0; 头-5 /proc/scullpmem
设备 0:qset 500,订单 0,sz 1535135
  项目位于 000001001847da58,qset 位于 000001001db4c000
       0:1001db56000
       1:1003d1c7000
   
salma% cat /tmp/bigfile > /dev/scullv0; 头-5 /proc/scullvmem

设备 0:qset 500,订单 4,sz 1535135
  项目位于 000001001847da58,qset 位于 0000010013dea000
       0:ffffff0001177000
       1:ffffff0001188000
salma% cat /tmp/bigfile > /dev/scullp0; head -5 /proc/scullpmem
Device 0: qset 500, order 0, sz 1535135
  item at 000001001847da58, qset at 000001001db4c000
       0:1001db56000
       1:1003d1c7000
   
salma% cat /tmp/bigfile > /dev/scullv0; head -5 /proc/scullvmem

Device 0: qset 500, order 4, sz 1535135
  item at 000001001847da58, qset at 0000010013dea000
       0:ffffff0001177000
       1:ffffff0001188000

相反,以下输出来自 x86 系统:

The following output, instead, came from an x86 system:

rudo% cat /tmp/bigfile > /dev/scullp0; 头-5 /proc/scullpmem

设备 0:qset 500,订单 0,sz 1535135
  项目位于 ccf80e00,qset 位于 cf7b9800
       0:ccc58000
       1:cccdd000

rudo% cat /tmp/bigfile > /dev/scullv0; 头-5 /proc/scullvmem

设备 0:qset 500,订单 4,sz 1535135
  项目位于 cfab4800,qset 位于 cf8e4000
       0:d087a000
       1:d08d2000
rudo% cat /tmp/bigfile > /dev/scullp0; head -5 /proc/scullpmem

Device 0: qset 500, order 0, sz 1535135
  item at ccf80e00, qset at cf7b9800
       0:ccc58000
       1:cccdd000

rudo%  cat /tmp/bigfile > /dev/scullv0; head -5 /proc/scullvmem

Device 0: qset 500, order 4, sz 1535135
  item at cfab4800, qset at cf8e4000
       0:d087a000
       1:d08d2000

这些值显示两种不同的行为。在 x86_64 上,物理地址和虚拟地址映射到完全不同的地址范围(0x100 和 0xffffff00),而在 x86 计算机上,vmalloc返回 虚拟地址位于用于物理内存的映射之上。

The values show two different behaviors. On x86_64, physical addresses and virtual addresses are mapped to completely different address ranges (0x100 and 0xffffff00), while on x86 computers, vmalloc returns virtual addresses just above the mapping used for physical memory.

每个 CPU 变量

Per-CPU Variables

每 CPU 变量是 2.6 内核的一个有趣的功能。当您创建每个 CPU 变量时,系统上的每个处理器都会获得该变量的自己的副本。这似乎是一件奇怪的事情,但它有其优点。访问每个 CPU 的变量(几乎)不需要锁定,因为每个处理器都使用自己的副本。每个 CPU 的变量也可以保留在各自处理器的缓存中,这可以显着提高频繁更新的数量的性能。

Per-CPU variables are an interesting 2.6 kernel feature. When you create a per-CPU variable, each processor on the system gets its own copy of that variable. This may seem like a strange thing to want to do, but it has its advantages. Access to per-CPU variables requires (almost) no locking, because each processor works with its own copy. Per-CPU variables can also remain in their respective processors' caches, which leads to significantly better performance for frequently updated quantities.

每个 CPU 变量使用的一个很好的例子可以在网络子系统中找到。内核维护着无数的计数器来跟踪收到的每种类型的数据包数量;这些计数器可以是你 每秒更新数千次。网络开发人员没有处理缓存和锁定问题,而是将统计计数器放入每个 CPU 的变量中。现在更新是无锁且快速的。在用户空间请求查看计数器值的极少数情况下,将每个处理器的版本相加并返回总数是一件简单的事情。

A good example of per-CPU variable use can be found in the networking subsystem. The kernel maintains no end of counters tracking how many of each type of packet was received; these counters can be u pdated thousands of times per second. Rather than deal with the caching and locking issues, the networking developers put the statistics counters into per-CPU variables. Updates are now lockless and fast. On the rare occasion that user space requests to see the values of the counters, it is a simple matter to add up each processor's version and return the total.

每个 CPU 变量的声明可以在<linux/percpu.h>中找到。要在编译时创建每个 CPU 变量,请使用以下宏:

The declarations for per-CPU variables can be found in <linux/percpu.h>. To create a per-CPU variable at compile time, use this macro:

DEFINE_PER_CPU(类型,名称);
DEFINE_PER_CPU(type, name);

如果变量(要调用的name)是一个数组,请在 中包含维度信息type。因此,将使用以下命令创建每个 CPU 的三个整数数组:

If the variable (to be called name) is an array, include the dimension information with the type. Thus, a per-CPU array of three integers would be created with:

DEFINE_PER_CPU(int[3], my_percpu_array);
DEFINE_PER_CPU(int[3], my_percpu_array);

几乎可以在没有显式锁定的情况下操作每个 CPU 的变量。请记住,2.6 内核是可抢占的;如果处理器在修改每个 CPU 变量的关键部分的中间被抢占,则不会这样做。如果您的进程在每个 CPU 变量访问过程中被移动到另一个处理器,这也不是一件好事。因此,您必须显式使用get_cpu_var宏来访问给定变量的当前处理器副本,并 在完成后调用put_cpu_var 。对get_cpu_var的调用 返回变量的当前处理器版本的左值并禁用抢占。由于返回的是左值,因此可以直接对其进行赋值或操作。例如,网络代码中的一个计数器通过以下两条语句递增:

Per-CPU variables can be manipulated without explicit locking—almost. Remember that the 2.6 kernel is preemptible; it would not do for a processor to be preempted in the middle of a critical section that modifies a per-CPU variable. It also would not be good if your process were to be moved to another processor in the middle of a per-CPU variable access. For this reason, you must explicitly use the get_cpu_var macro to access the current processor's copy of a given variable, and call put_cpu_var when you are done. The call to get_cpu_var returns an lvalue for the current processor's version of the variable and disables preemption. Since an lvalue is returned, it can be assigned to or operated on directly. For example, one counter in the networking code is incremented with these two statements:

get_cpu_var(sockets_in_use)++;
put_cpu_var(sockets_in_use);
get_cpu_var(sockets_in_use)++;
put_cpu_var(sockets_in_use);

您可以通过以下方式访问另一个处理器的变量副本:

You can access another processor's copy of the variable with:

per_cpu(变量, int cpu_id);
per_cpu(variable, int cpu_id);

如果您编写的代码涉及处理器访问彼此的每 CPU 变量,那么您当然必须实现一个锁定方案来确保该访问的安全。

If you write code that involves processors reaching into each other's per-CPU variables, you, of course, have to implement a locking scheme that makes that access safe.

动态分配每 CPU 变量也是可能的。这些变量可以分配:

Dynamically allocated per-CPU variables are also possible. These variables can be allocated with:

无效*alloc_percpu(类型);
void *_ _alloc_percpu(size_t 大小, size_t 对齐);
void *alloc_percpu(type);
void *_ _alloc_percpu(size_t size, size_t align);

在大多数情况下,alloc_percpu可以完成这项工作;在需要特定对齐的情况下,您可以调用__alloc_percpu 。无论哪种情况,都可以使用free_percpu将每个 CPU 的变量返回到系统 。对动态分配的每 CPU 变量的访问是通过per_cpu_ptr完成的:

In most cases, alloc_percpu does the job; you can call _ _alloc_percpu in cases where a particular alignment is required. In either case, a per-CPU variable can be returned to the system with free_percpu. Access to a dynamically allocated per-CPU variable is done via per_cpu_ptr:

per_cpu_ptr(void *per_cpu_var, int cpu_id);
per_cpu_ptr(void *per_cpu_var, int cpu_id);

per_cpu_var该宏返回一个指向与给定 对应的版本的指针cpu_id。如果您只是读取该变量的另一个 CPU 版本,则可以取消引用该指针并完成它。但是,如果您正在操纵当前处理器的版本,则可能需要首先确保您无法移出该处理器。如果对每个 CPU 变量的整个访问都发生在持有自旋锁的情况下,那么一切都很好。 然而,通常情况下,您需要在使用变量时使用get_cpu来阻止抢占。因此,使用动态每 CPU 变量的代码往往如下所示:

This macro returns a pointer to the version of per_cpu_var corresponding to the given cpu_id. If you are simply reading another CPU's version of the variable, you can dereference that pointer and be done with it. If, however, you are manipulating the current processor's version, you probably need to ensure that you cannot be moved out of that processor first. If the entirety of your access to the per-CPU variable happens with a spinlock held, all is well. Usually, however, you need to use get_cpu to block preemption while working with the variable. Thus, code using dynamic per-CPU variables tends to look like this:

中央处理器;

cpu = get_cpu( )
ptr = per_cpu_ptr(per_cpu_var, cpu);
/* 使用 ptr */
put_cpu();
int cpu;

cpu = get_cpu(  )
ptr = per_cpu_ptr(per_cpu_var, cpu);
/* work with ptr */
put_cpu(  );

当使用编译时每 CPU 变量时,get_cpu_varput_cpu_var宏会处理这些细节。每个 CPU 的动态变量需要更明确的保护。

When using compile-time per-CPU variables, the get_cpu_var and put_cpu_var macros take care of these details. Dynamic per-CPU variables require more explicit protection.

每个 CPU 的变量可以导出到模块,但您必须使用特殊版本的宏:

Per-CPU variables can be exported to modules, but you must use a special version of the macros:

EXPORT_PER_CPU_SYMBOL(per_cpu_var);
EXPORT_PER_CPU_SYMBOL_GPL(per_cpu_var);
EXPORT_PER_CPU_SYMBOL(per_cpu_var);
EXPORT_PER_CPU_SYMBOL_GPL(per_cpu_var);

要访问模块内的此类变量,请使用以下命令声明它:

To access such a variable within a module, declare it with:

DECLARE_PER_CPU(类型,名称);
DECLARE_PER_CPU(type, name);

使用DECLARE_PER_CPU(而不是 DEFINE_PER_CPU)告诉编译器正在进行外部引用。

The use of DECLARE_PER_CPU (instead of DEFINE_PER_CPU) tells the compiler that an external reference is being made.

如果您想使用每个 CPU 变量创建一个简单的整数计数器,请查看<linux/percpu_counter.h>中的固定实现。最后,请注意,某些体系结构可用于每个 CPU 变量的地址空间数量有限。如果您创建每个 CPU 代码中的变量,您应该尽量保持它们很小。

If you want to use per-CPU variables to create a simple integer counter, take a look at the canned implementation in <linux/percpu_counter.h>. Finally, note that some architectures have a limited amount of address space available for per-CPU variables. If you create per-CPU variables in your code, you should try to keep them small.

获取大缓冲区

Obtaining Large Buffers

正如我们在前面几节中提到的,大型连续内存缓冲区的分配很容易失败。随着时间的推移,系统内存会产生碎片,并且很可能真正大的内存区域根本不可用。由于通常有多种方法可以在没有巨大缓冲区的情况下完成工作,因此内核开发人员并没有高度重视使大型分配工作。在尝试获取大内存区域之前,您应该真正考虑替代方案。到目前为止,执行大型 I/O 操作的最佳方法是通过分散/聚集操作,我们将在第 15 章中对此进行讨论。

As we have noted in previous sections, allocations of large, contiguous memory buffers are prone to failure. System memory fragments over time, and chances are that a truly large region of memory will simply not be available. Since there are usually ways of getting the job done without huge buffers, the kernel developers have not put a high priority on making large allocations work. Before you try to obtain a large memory area, you should really consider the alternatives. By far the best way of performing large I/O operations is through scatter/gather operations, which we discuss in Chapter 15.

在启动时获取专用缓冲区

Acquiring a Dedicated Buffer at Boot Time

如果您确实需要物理连续内存的巨大缓冲区,最好的方法通常是通过在启动时请求内存来分配它。启动时分配是检索连续内存页面的唯一方法,同时绕过_ _get_free_pages对缓冲区大小施加的限制 ,无论是在最大允许大小还是有限的大小选择方面。在启动时分配内存是一种“肮脏”技术,因为它通过保留私有内存池来绕过所有内存管理策略。这种技术不优雅且不灵活,但也是最不容易失败的。不用说,模块不能在启动时分配内存;只有直接链接到内核的驱动程序才能做到这一点。

If you really need a huge buffer of physically contiguous memory, the best approach is often to allocate it by requesting memory at boot time. Allocation at boot time is the only way to retrieve consecutive memory pages while bypassing the limits imposed by _ _get_free_pages on the buffer size, both in terms of maximum allowed size and limited choice of sizes. Allocating memory at boot time is a "dirty" technique, because it bypasses all memory management policies by reserving a private memory pool. This technique is inelegant and inflexible, but it is also the least prone to failure. Needless to say, a module can't allocate memory at boot time; only drivers directly linked to the kernel can do that.

启动时分配的一个值得注意的问题是,它对于普通用户来说不是一个可行的选择,因为这种机制仅适用于内核映像中链接的代码。使用这种分配的设备驱动程序只能通过重建内核并重新启动计算机来安装或替换。

One noticeable problem with boot-time allocation is that it is not a feasible option for the average user, since this mechanism is available only for code linked in the kernel image. A device driver using this kind of allocation can be installed or replaced only by rebuilding the kernel and rebooting the computer.

当内核启动时,它可以访问系统中所有可用的物理内存。然后,它通过调用子系统的初始化函数来初始化每个子系统,从而允许初始化代码通过减少正常系统操作剩余的 RAM 量来分配内存缓冲区供私人使用。

When the kernel is booted, it gains access to all the physical memory available in the system. It then initializes each of its subsystems by calling that subsystem's initialization function, allowing initialization code to allocate a memory buffer for private use by reducing the amount of RAM left for normal system operation.

引导时内存分配是通过调用以下函数之一来执行的:

Boot-time memory allocation is performed by calling one of these functions:

#include <linux/bootmem.h>
void *alloc_bootmem(无符号长尺寸);
void *alloc_bootmem_low(无符号长尺寸);
void *alloc_bootmem_pages(无符号长尺寸);
void *alloc_bootmem_low_pages(无符号长尺寸);
#include <linux/bootmem.h>
void *alloc_bootmem(unsigned long size);
void *alloc_bootmem_low(unsigned long size);
void *alloc_bootmem_pages(unsigned long size);
void *alloc_bootmem_low_pages(unsigned long size);

这些函数分配整个页面(如果它们以 结尾_pages)或非页面对齐的内存区域。_low除非使用版本之一,否则分配的内存可能是高端内存。如果您为设备驱动程序分配此缓冲区,您可能希望将其用于 DMA 操作,而这对于高内存来说并不总是可行;因此,您可能想使用其中一种_low变体。

The functions allocate either whole pages (if they end with _pages) or non-page-aligned memory areas. The allocated memory may be high memory unless one of the _low versions is used. If you are allocating this buffer for a device driver, you probably want to use it for DMA operations, and that is not always possible with high memory; thus, you probably want to use one of the _low variants.

很少会释放启动时分配的内存;如果您想要的话,几乎可以肯定您以后将无法取回它。然而,有一个接口可以释放该内存:

It is rare to free memory allocated at boot time; you will almost certainly be unable to get it back later if you want it. There is an interface to free this memory, however:

void free_bootmem(无符号长地址,无符号长大小);
void free_bootmem(unsigned long addr, unsigned long size);

请注意,以这种方式释放的部分页面不会返回到系统,但是,如果您使用此技术,您可能已经分配了相当数量的整个页面。

Note that partial pages freed in this manner are not returned to the system—but, if you are using this technique, you have probably allocated a fair number of whole pages to begin with.

如果必须使用引导时分配,则需要将驱动程序直接链接到内核中。有关如何完成此操作的更多信息,请参阅Documentation/kbuild下内核源代码中的文件。

If you must use boot-time allocation, you need to link your driver directly into the kernel. See the files in the kernel source under Documentation/kbuild for more information on how this should be done.

快速参考

Quick Reference

与内存分配相关的函数和符号有:

The functions and symbols related to memory allocation are:

#include <linux/slab.h>

void *kmalloc(size_t size, int flags);

void kfree(void *obj);
#include <linux/slab.h>

void *kmalloc(size_t size, int flags);

void kfree(void *obj);

最常用的内存分配接口。

The most frequently used interface to memory allocation.

#include <linux/mm.h>

GFP_USER

GFP_KERNEL

GFP_NOFS

GFP_NOIO

GFP_ATOMIC
#include <linux/mm.h>

GFP_USER

GFP_KERNEL

GFP_NOFS

GFP_NOIO

GFP_ATOMIC

控制方式的标志 内存分配是按照从最少限制到最多限制的顺序执行的。和 优先级允许当前进程进入睡眠状态以满足请求 GFP_USER。并分别禁用文件系统操作和所有 I/O 操作,而分配根本无法休眠。GFP_KERNELGFP_NOFSGFP_NOIOGFP_ATOMIC

Flags that control how memory allocations are performed, from the least restrictive to the most. The GFP_USER and GFP_KERNEL priorities allow the current process to be put to sleep to satisfy the request. GFP_NOFS and GFP_NOIO disable filesystem operations and all I/O operations, respectively, while GFP_ATOMIC allocations cannot sleep at all.

_ _GFP_DMA

_ _GFP_HIGHMEM

_ _GFP_COLD

_ _GFP_NOWARN

_ _GFP_HIGH

_ _GFP_REPEAT

_ _GFP_NOFAIL

_ _GFP_NORETRY
_ _GFP_DMA

_ _GFP_HIGHMEM

_ _GFP_COLD

_ _GFP_NOWARN

_ _GFP_HIGH

_ _GFP_REPEAT

_ _GFP_NOFAIL

_ _GFP_NORETRY

这些标志修改内核在分配内存时的行为。

These flags modify the kernel's behavior when allocating memory.

#include <linux/malloc.h>

kmem_cache_t *kmem_cache_create(char *name, size_t size, size_t offset,

unsigned long flags, constructor( ), destructor( ));

int kmem_cache_destroy(kmem_cache_t *cache);
#include <linux/malloc.h>

kmem_cache_t *kmem_cache_create(char *name, size_t size, size_t offset,

unsigned long flags, constructor( ), destructor( ));

int kmem_cache_destroy(kmem_cache_t *cache);

创建和销毁slab缓存。缓存可用于分配多个相同大小的对象。

Create and destroy a slab cache. The cache can be used to allocate several objects of the same size.

SLAB_NO_REAP

SLAB_HWCACHE_ALIGN

SLAB_CACHE_DMA
SLAB_NO_REAP

SLAB_HWCACHE_ALIGN

SLAB_CACHE_DMA

创建缓存时可以指定的标志。

Flags that can be specified while creating a cache.

SLAB_CTOR_ATOMIC

SLAB_CTOR_CONSTRUCTOR
SLAB_CTOR_ATOMIC

SLAB_CTOR_CONSTRUCTOR

分配器的标志 可以传递给构造函数和析构函数。

Flags that the allocator can pass to the constructor and the destructor functions.

void *kmem_cache_alloc(kmem_cache_t *cache, int flags);

void kmem_cache_free(kmem_cache_t *cache, const void *obj);
void *kmem_cache_alloc(kmem_cache_t *cache, int flags);

void kmem_cache_free(kmem_cache_t *cache, const void *obj);

从缓存中分配和释放单个对象。

Allocate and release a single object from the cache.

/proc/slabinfo
/proc/slabinfo

包含slab缓存使用统计信息的虚拟文件。

A virtual file containing statistics on slab cache usage.

#include <linux/mempool.h>

mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn, mempool_free_t

*free_fn, void *data);

void mempool_destroy(mempool_t *pool);
#include <linux/mempool.h>

mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn, mempool_free_t

*free_fn, void *data);

void mempool_destroy(mempool_t *pool);

用于创建的函数 内存池,它试图通过保留已分配项目的“紧急列表”来避免内存分配失败。

Functions for the creation of memory pools, which try to avoid memory allocation failures by keeping an "emergency list" of allocated items.

void *mempool_alloc(mempool_t *pool, int gfp_mask);

void mempool_free(void *element, mempool_t *pool);
void *mempool_alloc(mempool_t *pool, int gfp_mask);

void mempool_free(void *element, mempool_t *pool);

用于从内存池分配项目(并将其返回到内存池)的函数。

Functions for allocating items from (and returning them to) memory pools.

unsigned long get_zeroed_page(int flags);

unsigned long _ _get_free_page(int flags);

unsigned long _ _get_free_pages(int flags, unsigned long order);
unsigned long get_zeroed_page(int flags);

unsigned long _ _get_free_page(int flags);

unsigned long _ _get_free_pages(int flags, unsigned long order);

面向页的分配函数。get_zeroed_pa​​ge 返回单个零填充页面。调用的所有其他版本不会初始化返回页面的内容。

The page-oriented allocation functions. get_zeroed_page returns a single, zero-filled page. All the other versions of the call do not initialize the contents of the returned page(s).

int get_order(unsigned long size);
int get_order(unsigned long size);

返回分配顺序 根据 ,与当前平台的大小相关PAGE_SIZE。参数必须是 2 的幂,并且返回值至少为0

Returns the allocation order associated to size in the current platform, according to PAGE_SIZE. The argument must be a power of two, and the return value is at least 0.

void free_page(unsigned long addr);

void free_pages(unsigned long addr, unsigned long order);
void free_page(unsigned long addr);

void free_pages(unsigned long addr, unsigned long order);

释放面向页面的分配的函数。

Functions that release page-oriented allocations.

struct page *alloc_pages_node(int nid, unsigned int flags, unsigned int order);

struct page *alloc_pages(unsigned int flags, unsigned int order);

struct page *alloc_page(unsigned int flags);
struct page *alloc_pages_node(int nid, unsigned int flags, unsigned int order);

struct page *alloc_pages(unsigned int flags, unsigned int order);

struct page *alloc_page(unsigned int flags);

Linux 内核中最低级页面分配器的所有变体。

All variants of the lowest-level page allocator in the Linux kernel.

void _ _free_page(struct page *page);

void _ _free_pages(struct page *page, unsigned int order);

void free_hot_page(struct page *page);

void free_cold_page(struct page *page);
void _ _free_page(struct page *page);

void _ _free_pages(struct page *page, unsigned int order);

void free_hot_page(struct page *page);

void free_cold_page(struct page *page);

释放使用alloc_page形式之一分配的页面的各种方法 。

Various ways of freeing pages allocated with one of the forms of alloc_page.

#include <linux/vmalloc.h>

void * vmalloc(unsigned long size);

void vfree(void * addr);

#include <asm/io.h>

void * ioremap(unsigned long offset, unsigned long size);

void iounmap(void *addr);
#include <linux/vmalloc.h>

void * vmalloc(unsigned long size);

void vfree(void * addr);

#include <asm/io.h>

void * ioremap(unsigned long offset, unsigned long size);

void iounmap(void *addr);

分配或释放连续虚拟地址空间的函数。ioremap通过虚拟地址访问物理内存,而vmalloc则分配空闲页。使用ioremap映射的区域被释放 使用 iounmap,而从vmalloc获取的页面则使用vfree 释放。

Functions that allocate or free a contiguous virtual address space. ioremap accesses physical memory through virtual addresses, while vmalloc allocates free pages. Regions mapped with ioremap are freed with iounmap, while pages obtained from vmalloc are released with vfree.

#include <linux/percpu.h>

DEFINE_PER_CPU(type, name);

DECLARE_PER_CPU(type, name);
#include <linux/percpu.h>

DEFINE_PER_CPU(type, name);

DECLARE_PER_CPU(type, name);

定义和声明每个 CPU 变量的宏。

Macros that define and declare per-CPU variables.

per_cpu(variable, int cpu_id)

get_cpu_var(variable)

put_cpu_var(variable)
per_cpu(variable, int cpu_id)

get_cpu_var(variable)

put_cpu_var(variable)

提供对静态声明的每 CPU 变量的访问的宏。

Macros that provide access to statically declared per-CPU variables.

void *alloc_percpu(type);

void *_ _alloc_percpu(size_t size, size_t align);

void free_percpu(void *variable);
void *alloc_percpu(type);

void *_ _alloc_percpu(size_t size, size_t align);

void free_percpu(void *variable);

执行运行时分配和释放每 CPU 变量的函数。

Functions that perform runtime allocation and freeing of per-CPU variables.

int get_cpu( );

void put_cpu( );

per_cpu_ptr(void *variable, int cpu_id)
int get_cpu( );

void put_cpu( );

per_cpu_ptr(void *variable, int cpu_id)

get_cpu获取对当前处理器的引用(因此,防止抢占和移动到另一个处理器)并返回处理器的 ID 号;put_cpu返回该引用。要访问动态分配的每 CPU 变量,请使用per_cpu_ptr 以及应访问其版本的 CPU ID。对当前 CPU 版本的动态、每 CPU 变量的操作可能应该包含在对 get_cpuput_cpu 的调用中。

get_cpu obtains a reference to the current processor (therefore, preventing preemption and movement to another processor) and returns the ID number of the processor; put_cpu returns that reference. To access a dynamically allocated per-CPU variable, use per_cpu_ptr with the ID of the CPU whose version should be accessed. Manipulations of the current CPU's version of a dynamic, per-CPU variable should probably be surrounded by calls to get_cpu and put_cpu.

#include <linux/bootmem.h>

void *alloc_bootmem(unsigned long size);

void *alloc_bootmem_low(unsigned long size);

void *alloc_bootmem_pages(unsigned long size);

void *alloc_bootmem_low_pages(unsigned long size);

void free_bootmem(unsigned long addr, unsigned long size);
#include <linux/bootmem.h>

void *alloc_bootmem(unsigned long size);

void *alloc_bootmem_low(unsigned long size);

void *alloc_bootmem_pages(unsigned long size);

void *alloc_bootmem_low_pages(unsigned long size);

void free_bootmem(unsigned long addr, unsigned long size);

执行内存分配和释放的函数(只能由直接链接到内核的驱动程序使用) 系统引导时间。

Functions (which can be used only by drivers directly linked into the kernel) that perform allocation and freeing of memory at system bootstrap time.




[ 1 ]除此之外,这意味着您应该显式清除可能暴露给用户空间或写入设备的任何内存;否则,您可能会泄露应保密的信息。

[1] Among other things, this implies that you should explicitly clear any memory that might be exposed to user space or written to a device; otherwise, you risk disclosing information that should be kept private.

[ 2 ]虽然alloc_pages (稍后描述)确实应该用于分配高内存页面,但由于一些原因我们直到第15章才能真正了解。

[2] Although alloc_pages (described shortly) should really be used for allocating high-memory pages, for reasons we can't really get into until Chapter 15.

[ 3 ] NUMA(非均匀内存访问)计算机是多处理器系统,其中内存对于特定处理器组(“节点”)来说是“本地的”。访问本地内存比访问非本地内存更快。在此类系统上,在正确的节点上分配内存非常重要。不过,驱动程序作者通常不必担心 NUMA 问题。

[3] NUMA (nonuniform memory access) computers are multiprocessor systems where memory is "local" to specific groups of processors ("nodes"). Access to local memory is faster than access to nonlocal memory. On such systems, allocating memory on the correct node is important. Driver authors do not normally have to worry about NUMA issues, however.

[ 4 ]实际上,一些体系结构将“虚拟”地址范围定义为保留用于寻址物理内存。当发生这种情况时,Linux 内核会利用该功能,并且内核和_ _get_free_pages地址都位于这些内存范围之一中。这种差异对于设备驱动程序和其他不直接涉及内存管理内核子系统的代码来说是透明的。

[4] Actually, some architectures define ranges of "virtual" addresses as reserved to address physical memory. When this happens, the Linux kernel takes advantage of the feature, and both the kernel and _ _get_free_pages addresses lie in one of those memory ranges. The difference is transparent to device drivers and other code that is not directly involved with the memory-management kernel subsystem.

第 9 章与硬件通信

Chapter 9. Communicating with Hardware

尽管玩scull和类似的玩具可以很好地介绍 Linux 设备驱动程序的软件接口,但实现真实的 设备需要硬件。驱动程序是软件概念和硬件电路之间的抽象层;因此,它需要与他们双方交谈。到目前为止,我们已经研究了软件概念的内部结构;本章通过向您展示驱动程序如何访问 I/O 端口和 I/O 内存同时跨 Linux 平台移植来完成整个图片。

Although playing with scull and similar toys is a good introduction to the software interface of a Linux device driver, implementing a real device requires hardware. The driver is the abstraction layer between software concepts and hardware circuitry; as such, it needs to talk with both of them. Up until now, we have examined the internals of software concepts; this chapter completes the picture by showing you how a driver can access I/O ports and I/O memory while being portable across Linux platforms.

本章延续了尽可能独立于特定硬件的传统。然而,在需要具体示例的地方,我们使用简单的数字 I/O 端口(例如标准 PC 并行端口)来展示 I/O 指令如何工作,并使用普通的帧缓冲视频内存来展示内存映射的 I/O。

This chapter continues in the tradition of staying as independent of specific hardware as possible. However, where specific examples are needed, we use simple digital I/O ports (such as the standard PC parallel port) to show how the I/O instructions work and normal frame-buffer video memory to show memory-mapped I/O.

我们选择简单的数字 I/O,因为它是最简单的输入/输出端口形式。此外,并行端口实现原始 I/O,并且在大多数计算机中可用:写入设备的数据位出现在输出引脚上,并且输入引脚上的电压电平可由处理器直接访问。实际上,您必须将 LED 或打印机连接到端口才能实际看到数字 I/O 操作的结果,但底层硬件非常易于使用。

We chose simple digital I/O because it is the easiest form of an input/output port. Also, the parallel port implements raw I/O and is available in most computers: data bits written to the device appear on the output pins, and voltage levels on the input pins are directly accessible by the processor. In practice, you have to connect LEDs or a printer to the port to actually see the results of a digital I/O operation, but the underlying hardware is extremely easy to use.

I/O 端口和 I/O 内存

I/O Ports and I/O Memory

每个外围设备都是 通过写入和读取其寄存器来控制。大多数时候,设备有多个寄存器,并且在连续的地址上访问它们,或者在内存地址空间中,或者在 I/O 地址空间中。

Every peripheral device is controlled by writing and reading its registers. Most of the time a device has several registers, and they are accessed at consecutive addresses, either in the memory address space or in the I/O address space.

在硬件层面,内存区域和I/O区域之间没有概念上的区别:两者都是通过在地址总线和控制总线上断言电信号(即读和写信号)[1]并通过 读取访问它们。来自或写入数据总线。

At the hardware level, there is no conceptual difference between memory regions and I/O regions: both of them are accessed by asserting electrical signals on the address bus and control bus (i.e., the read and write signals)[1] and by reading from or writing to the data bus.

虽然一些 CPU 制造商在其芯片中实现了单一地址空间,但其他制造商认为外围设备与内存不同,因此应该有一个单独的地址空间。某些处理器(最著名的是 x86 系列)具有用于 I/O 端口的单独 读写电线以及用于访问端口的特殊 CPU 指令。

While some CPU manufacturers implement a single address space in their chips, others decided that peripheral devices are different from memory and, therefore, deserve a separate address space. Some processors (most notably the x86 family) have separate read and write electrical lines for I/O ports and special CPU instructions to access ports.

由于外围设备是为了适应外围总线而构建的,并且最流行的 I/O 总线是在个人计算机上建模的,因此即使没有单独的 I/O 端口地址空间的处理器也必须伪造读取和写入 I/O 端口当访问一些外围设备时,通常通过外部芯片组或CPU内核中的额外电路来实现。后一种解决方案在用于嵌入式用途的微型处理器中很常见。

Because peripheral devices are built to fit a peripheral bus, and the most popular I/O buses are modeled on the personal computer, even processors that do not have a separate address space for I/O ports must fake reading and writing I/O ports when accessing some peripheral devices, usually by means of external chipsets or extra circuitry in the CPU core. The latter solution is common within tiny processors meant for embedded use.

出于同样的原因,Linux 在其运行的所有计算机平台上都实现了 I/O 端口的概念,甚至在 CPU 实现单个地址空间的平台上也是如此。端口访问的实现有时取决于主机的具体品牌和型号(因为不同型号使用不同的芯片组将总线事务映射到内存地址空间)。

For the same reason, Linux implements the concept of I/O ports on all computer platforms it runs on, even on platforms where the CPU implements a single address space. The implementation of port access sometimes depends on the specific make and model of the host computer (because different models use different chipsets to map bus transactions into memory address space).

即使外设总线具有用于 I/O 端口的单独地址空间,并非所有设备都将其寄存器映射到 I/O 端口。虽然 I/O 端口的使用对于 ISA 外围板来说很常见,但大多数 PCI 设备将寄存器映射到内存地址区域。这种 I/O 内存方法通常是首选,因为它不需要使用专用处理器指令;CPU 内核访问内存的效率要高得多,并且编译器在访问内存时在寄存器分配和寻址模式选择方面有更大的自由度。

Even if the peripheral bus has a separate address space for I/O ports, not all devices map their registers to I/O ports. While use of I/O ports is common for ISA peripheral boards, most PCI devices map registers into a memory address region. This I/O memory approach is generally preferred, because it doesn't require the use of special-purpose processor instructions; CPU cores access memory much more efficiently, and the compiler has much more freedom in register allocation and addressing-mode selection when accessing memory.

I/O 寄存器和传统存储器

I/O Registers and Conventional Memory

尽管两者之间有很强的相似性硬件寄存器和内存中,访问 I/O 寄存器的程序员必须小心,避免被可以修改预期 I/O 行为的 CPU(或编译器)优化所欺骗。

Despite the strong similarity between hardware registers and memory, a programmer accessing I/O registers must be careful to avoid being tricked by CPU (or compiler) optimizations that can modify the expected I/O behavior.

I/O 寄存器和 RAM 之间的主要区别在于 I/O 操作有副作用,而内存操作则没有:内存写入的唯一作用是将值存储到某个位置,而内存读取则返回最后写入的值那里。由于内存访问速度对 CPU 性能至关重要,因此已通过多种方式对无副作用情况进行了优化:缓存值以及重新排序读/写指令。

The main difference between I/O registers and RAM is that I/O operations have side effects, while memory operations have none: the only effect of a memory write is storing a value to a location, and a memory read returns the last value written there. Because memory access speed is so critical to CPU performance, the no-side-effects case has been optimized in several ways: values are cached and read/write instructions are reordered.

编译器可以将数据值缓存到 CPU 寄存器中,而不会将其写入内存,即使存储它们,写入和读取操作也可以在缓存内存上进行操作,而无需到达物理 RAM。重新排序也可以发生在编译器级别和硬件级别:如果指令序列以与程序文本中出现的顺序不同的顺序运行,通常可以更快地执行指令序列,例如,为了防止程序中的互锁RISC 流水线。在 CISC 处理器上,需要大量时间的操作可以与其他更快的操作同时执行。

The compiler can cache data values into CPU registers without writing them to memory, and even if it stores them, both write and read operations can operate on cache memory without ever reaching physical RAM. Reordering can also happen both at the compiler level and at the hardware level: often a sequence of instructions can be executed more quickly if it is run in an order different from that which appears in the program text, for example, to prevent interlocks in the RISC pipeline. On CISC processors, operations that take a significant amount of time can be executed concurrently with other, quicker ones.

当应用于传统内存时(至少在单处理器系统上),这些优化是透明且良性的,但它们对于正确的 I/O 操作可能是致命的,因为它们会干扰那些“副作用”,而这些“副作用”是驱动程序访问 I/O 的主要原因。 /O 寄存器。处理器无法预见某些其他进程(在单独的处理器上运行,或 I/O 控制器内部发生的事情)依赖于内存访问顺序的情况。编译器或 CPU 可能只是试图智取您并重新排序您请求的操作;结果可能是非常难以调试的奇怪错误。因此,驱动程序必须确保在访问寄存器时不执行缓存并且不发生读或写重新排序。

These optimizations are transparent and benign when applied to conventional memory (at least on uniprocessor systems), but they can be fatal to correct I/O operations, because they interfere with those "side effects" that are the main reason why a driver accesses I/O registers. The processor cannot anticipate a situation in which some other process (running on a separate processor, or something happening inside an I/O controller) depends on the order of memory access. The compiler or the CPU may just try to outsmart you and reorder the operations you request; the result can be strange errors that are very difficult to debug. Therefore, a driver must ensure that no caching is performed and no read or write reordering takes place when accessing registers.

硬件缓存的问题是最容易面临的:底层硬件已经配置(自动或通过 Linux 初始化代码)在访问 I/O 区域(无论是内存还是端口区域)时禁用任何硬件缓存。

The problem with hardware caching is the easiest to face: the underlying hardware is already configured (either automatically or by Linux initialization code) to disable any hardware cache when accessing I/O regions (whether they are memory or port regions).

编译器优化和硬件重新排序的解决方案是在必须按特定顺序对硬件(或另一个处理器)可见的操作之间放置内存屏障。Linux 提供了四个宏来满足所有可能的排序需求:

The solution to compiler optimization and hardware reordering is to place a memory barrier between operations that must be visible to the hardware (or to another processor) in a particular order. Linux provides four macros to cover all possible ordering needs:

#include <linux/kernel.h>

void barrier(void)
#include <linux/kernel.h>

void barrier(void)

该函数告诉编译器插入内存屏障,但对硬件没有影响。编译后的代码将当前修改并驻留在 CPU 寄存器中的所有值存储到内存中,并在需要时重新读取它们。对屏障的调用可以防止跨屏障的编译器优化,但可以让硬件自由地进行自己的重新排序。

This function tells the compiler to insert a memory barrier but has no effect on the hardware. Compiled code stores to memory all values that are currently modified and resident in CPU registers, and rereads them later when they are needed. A call to barrier prevents compiler optimizations across the barrier but leaves the hardware free to do its own reordering.

#include <asm/system.h>

void rmb(void);

void read_barrier_depends(void);

void wmb(void);

void mb(void);
#include <asm/system.h>

void rmb(void);

void read_barrier_depends(void);

void wmb(void);

void mb(void);

这些函数在编译的指令流中插入硬件内存屏障;它们的实际实例化取决于平台。rmb (读内存屏障)保证屏障之前出现的任何读取都在执行任何后续读取之前完成wmb 保证写操作的顺序,而mb指令则保证两者。这些函数中的每一个都是Barrier的超集 。

These functions insert hardware memory barriers in the compiled instruction flow; their actual instantiation is platform dependent. An rmb (read memory barrier) guarantees that any reads appearing before the barrier are completed prior to the execution of any subsequent read. wmb guarantees ordering in write operations, and the mb instruction guarantees both. Each of these functions is a superset of barrier.

read_barrier_depends是一种特殊的、较弱的读屏障形式。rmb阻止跨屏障的所有读取重新排序,而 read_barrier_depends仅阻止依赖于其他读取数据的读取重新排序这种区别很微妙,并且并非在所有架构上都存在。除非您确切了解发生了什么,并且您有理由相信完整的读取屏障会造成过高的性能成本,否则您可能应该坚持使用rmb

read_barrier_depends is a special, weaker form of read barrier. Whereas rmb prevents the reordering of all reads across the barrier, read_barrier_depends blocks only the reordering of reads that depend on data from other reads. The distinction is subtle, and it does not exist on all architectures. Unless you understand exactly what is going on, and you have a reason to believe that a full read barrier is exacting an excessive performance cost, you should probably stick to using rmb.

void smp_rmb(void);

void smp_read_barrier_depends(void);

void smp_wmb(void);

void smp_mb(void);
void smp_rmb(void);

void smp_read_barrier_depends(void);

void smp_wmb(void);

void smp_mb(void);

这些版本的屏障宏仅在为 SMP 系统编译内核时才插入硬件屏障;否则,它们都会扩展为简单的 屏障调用。

These versions of the barrier macros insert hardware barriers only when the kernel is compiled for SMP systems; otherwise, they all expand to a simple barrier call.

设备驱动程序中内存屏障的典型用法可能有这种形式:

A typical usage of memory barriers in a device driver may have this sort of form:

writel(dev->registers.addr, io_destination_address);
writel(dev->registers.size, io_size);
writel(dev->registers.operation, DEV_READ);
wmb();
writel(dev->registers.control, DEV_GO);
writel(dev->registers.addr, io_destination_address);
writel(dev->registers.size, io_size);
writel(dev->registers.operation, DEV_READ);
wmb(  );
writel(dev->registers.control, DEV_GO);

在这种情况下,重要的是要确保在告诉它开始之前控制特定操作的所有设备寄存器都已正确设置。内存屏障强制按必要的顺序完成写入。

In this case, it is important to be sure that all of the device registers controlling a particular operation have been properly set prior to telling it to begin. The memory barrier enforces the completion of the writes in the necessary order.

由于内存屏障会影响性能,因此应仅在真正需要的地方使用它们。不同类型的屏障也可能具有不同的性能特征,因此值得使用尽可能具体的类型。例如,在x86架构上,wmb()当前不执行任何操作,因为处理器外部的写入不会被重新排序。然而,读取会被重新排序,因此 mb()比wmb()慢。

Because memory barriers affect performance, they should be used only where they are really needed. The different types of barriers can also have different performance characteristics, so it is worthwhile to use the most specific type possible. For example, on the x86 architecture, wmb( ) currently does nothing, since writes outside the processor are not reordered. Reads are reordered, however, so mb( ) is slower than wmb( ).

值得注意的是,大多数其他处理同步的内核原语(例如自旋锁和atomic_t 操作)也充当内存屏障。另外值得注意的是,一些外设总线(例如 PCI 总线)有其自身的缓存问题;我们将在后面的章节中讨论这些内容。

It is worth noting that most of the other kernel primitives dealing with synchronization, such as spinlock and atomic_t operations, also function as memory barriers. Also worthy of note is that some peripheral buses (such as the PCI bus) have caching issues of their own; we discuss those when we get to them in later chapters.

一些架构允许赋值和内存屏障的有效组合。内核提供了一些执行这种组合的宏;在默认情况下,它们定义如下:

Some architectures allow the efficient combination of an assignment and a memory barrier. The kernel provides a few macros that perform this combination; in the default case, they are defined as follows:

#define set_mb(var, value) do {var = value; mb( );} 而 0
#define set_wmb(var, value) do {var = value; wmb( );} 而 0
#define set_rmb(var, value) do {var = value; rmb( );} 而 0
#define set_mb(var, value)  do {var = value; mb(  );}  while 0
#define set_wmb(var, value) do {var = value; wmb(  );} while 0
#define set_rmb(var, value) do {var = value; rmb(  );} while 0

在适当的情况下,<asm/system.h>定义这些宏以使用特定于体系结构的指令来更快地完成任务。请注意,set_rmb仅由少数架构定义。(构造的使用do...while是标准 C 习惯用法,它使扩展宏在所有上下文中都像普通 C 语句一样工作。)

Where appropriate, <asm/system.h> defines these macros to use architecture-specific instructions that accomplish the task more quickly. Note that set_rmb is defined only by a small number of architectures. (The use of a do...while construct is a standard C idiom that causes the expanded macro to work as a normal C statement in all contexts.)

使用 I/O 端口

Using I/O Ports

I/O 端口是驱动程序与许多设备(至少在部分时间)进行通信的方式。本节介绍了可用于使用 I/O 端口的各种功能;我们还涉及一些可移植性问题。

I/O ports are the means by which drivers communicate with many devices, at least part of the time. This section covers the various functions available for making use of I/O ports; we also touch on some portability issues.

I/O端口分配

I/O Port Allocation

正如您所期望的,您在没有首先确保您具有对这些端口的独占访问权限的情况下,不应启动并开始冲击 I/O 端口。内核提供了一个注册接口,允许您的驱动程序声明其所需的端口。该接口中的核心函数是request_region

As you might expect, you should not go off and start pounding on I/O ports without first ensuring that you have exclusive access to those ports. The kernel provides a registration interface that allows your driver to claim the ports it needs. The core function in that interface is request_region:

#include <linux/ioport.h>
结构资源* request_region(无符号长第一个,无符号长n,
                                const char *名称);
#include <linux/ioport.h>
struct resource *request_region(unsigned long first, unsigned long n, 
                                const char *name);

该函数告诉内核您想要使用nfirst. 该name参数应该是您的设备的名称。NULL如果分配成功,则返回值为非。如果您NULLrequest_region返回,您将无法使用所需的端口。

This function tells the kernel that you would like to make use of n ports, starting with first. The name parameter should be the name of your device. The return value is non-NULL if the allocation succeeds. If you get NULL back from request_region, you will not be able to use the desired ports.

所有端口分配都显示在/proc/ioports中。如果您无法分配所需的一组端口,则可以在此处查看谁先到达那里。

All port allocations show up in /proc/ioports. If you are unable to allocate a needed set of ports, that is the place to look to see who got there first.

当您完成一组 I/O 端口(可能在模块卸载时)时,应使用以下命令将它们返回到系统:

When you are done with a set of I/O ports (at module unload time, perhaps), they should be returned to the system with:

voidrelease_region(无符号长开始,无符号长n);
void release_region(unsigned long start, unsigned long n);

还有一个函数允许您的驱动程序检查一组给定的 I/O 端口是否可用:

There is also a function that allows your driver to check to see whether a given set of I/O ports is available:

int check_region(无符号长第一个,无符号长n);
int check_region(unsigned long first, unsigned long n);

此处,如果给定端口不可用,则返回值为负错误代码。该函数已被弃用,因为它的返回值不能保证分配是否成功;检查和稍后分配不是原子操作。我们在这里列出它是因为有几个驱动程序仍在使用它,但您应该始终使用 request_region,它执行所需的锁定以确保分配以安全、原子的方式完成。

Here, the return value is a negative error code if the given ports are not available. This function is deprecated because its return value provides no guarantee of whether an allocation would succeed; checking and later allocating are not an atomic operation. We list it here because several drivers are still using it, but you should always use request_region, which performs the required locking to ensure that the allocation is done in a safe, atomic manner.

操作 I/O 端口

Manipulating I/O ports

当驱动程序请求其活动中需要使用的 I/O 端口范围后,它必须读取和/或写入这些端口。为此,大多数硬件都会区分 8 位、16 位和 32 位端口。通常,您不能像通常处理系统内存访问那样混合使用它们。[ 2 ]

After a driver has requested the range of I/O ports it needs to use in its activities, it must read and/or write to those ports. To this end, most hardware differentiates between 8-bit, 16-bit, and 32-bit ports. Usually you can't mix them like you normally do with system memory access.[2]

因此,AC程序必须调用不同的函数来访问不同大小的端口。正如上一节中所建议的,仅支持内存映射 I/O 的计算机体系结构通过将端口地址重新映射到内存地址来注册假端口 I/O,并且内核向驱动程序隐藏详细信息以简化可移植性。Linux 内核头文件(具体来说,与体系结构相关的头文件<asm/io.h>)定义了以下内联函数来访问 I/O 端口:

A C program, therefore, must call different functions to access different size ports. As suggested in the previous section, computer architectures that support only memory-mapped I/O registers fake port I/O by remapping port addresses to memory addresses, and the kernel hides the details from the driver in order to ease portability. The Linux kernel headers (specifically, the architecture-dependent header <asm/io.h>) define the following inline functions to access I/O ports:

unsigned inb(unsigned port);

void outb(unsigned char byte, unsigned port);
unsigned inb(unsigned port);

void outb(unsigned char byte, unsigned port);

读取或写入字节端口(八位宽)。该参数针对某些平台和其他平台port进行定义。inb的返回类型 在不同的架构中也不同。unsigned longunsigned short

Read or write byte ports (eight bits wide). The port argument is defined as unsigned long for some platforms and unsigned short for others. The return type of inb is also different across architectures.

unsigned inw(unsigned port);

void outw(unsigned short word, unsigned port);
unsigned inw(unsigned port);

void outw(unsigned short word, unsigned port);

这些函数访问 16 位端口(一个字宽);当针对仅支持字节 I/O 的 S390 平台进行编译时,它们不可用。

These functions access 16-bit ports (one word wide); they are not available when compiling for the S390 platform, which supports only byte I/O.

unsigned inl(unsigned port);

void outl(unsigned longword, unsigned port);
unsigned inl(unsigned port);

void outl(unsigned longword, unsigned port);

这些函数访问 32 位端口。根据平台,longword被声明为 或unsigned longunsigned int与字 I/O 一样,“长”I/O 在 S390 上不可用。

These functions access 32-bit ports. longword is declared as either unsigned long or unsigned int, according to the platform. Like word I/O, "long" I/O is not available on S390.

提示

Tip

从现在开始,当我们在unsigned没有进一步类型规范的情况下使用时,我们指的是依赖于体系结构的定义,其确切性质并不相关。这些函数几乎总是可移植的,因为编译器在赋值期间自动转换值 - 它们的无符号有助于防止编译时警告。只要程序员分配合理的值以避免溢出,这种类型转换就不会丢失任何信息。我们在本章中始终坚持“不完整打字”的惯例。

From now on, when we use unsigned without further type specifications, we are referring to an architecture-dependent definition whose exact nature is not relevant. The functions are almost always portable, because the compiler automatically casts the values during assignment—their being unsigned helps prevent compile-time warnings. No information is lost with such casts as long as the programmer assigns sensible values to avoid overflow. We stick to this convention of "incomplete typing" throughout this chapter.

请注意,未定义 64 位端口 I/O 操作。即使在 64 位架构上,端口地址空间也使用 32 位(最大)数据路径。

Note that no 64-bit port I/O operations are defined. Even on 64-bit architectures, the port address space uses a 32-bit (maximum) data path.

从用户空间访问 I/O 端口

I/O Port Access from User Space

刚刚描述的功能是 主要供设备驱动程序使用,但它们也可以从用户空间使用,至少在 PC 级计算机上。GNU C 库在<sys/io.h>中定义它们。为了在用户空间代码中使用inb和朋友,应满足以下条件:

The functions just described are primarily meant to be used by device drivers, but they can also be used from user space, at least on PC-class computers. The GNU C library defines them in <sys/io.h>. The following conditions should apply in order for inb and friends to be used in user-space code:

  • 必须使用-O选项编译程序以强制扩展内联函数。

  • The program must be compiled with the -O option to force expansion of inline functions.

  • 必须使用 ioperm 或 iopl 系统调用来获取在端口上执行 I/O操作的权限ioperm 获得单个端口的权限,而iopl获得整个 I/O 空间的权限。这两个函数都是 x86 特定的。

  • The ioperm or iopl system calls must be used to get permission to perform I/O operations on ports. ioperm gets permission for individual ports, while iopl gets permission for the entire I/O space. Both of these functions are x86-specific.

  • 该程序必须以 root 身份运行才能调用iopermiopl[ 3 ]或者,其祖先之一必须获得以 root 身份运行的端口访问权限。

  • The program must run as root to invoke ioperm or iopl.[3] Alternatively, one of its ancestors must have gained port access running as root.

如果主机平台没有iopermiopl系统调用,用户空间仍然可以使用/dev/port设备文件访问 I/O 端口。但请注意,该文件的含义是非常特定于平台的,除了 PC 之外,不太可能对任何其他设备有用。

If the host platform has no ioperm and no iopl system calls, user space can still access I/O ports by using the /dev/port device file. Note, however, that the meaning of the file is very platform-specific and not likely useful for anything but the PC.

示例源misc-progs/inp.cmisc-progs/outp.c是在用户空间中从命令行读写端口的最小工具。它们期望以多个名称安装(例如,inbinwinl ,并根据用户调用的名称来操作字节、字或长端口)。他们 在x86下使用iopermiopl ,在其他平台上使用/dev/port 。

The sample sources misc-progs/inp.c and misc-progs/outp.c are a minimal tool for reading and writing ports from the command line, in user space. They expect to be installed under multiple names (e.g., inb, inw, and inl and manipulates byte, word, or long ports depending on which name was invoked by the user). They use ioperm or iopl under x86, /dev/port on other platforms.

如果您想危险地生活并在不获取显式特权的情况下使用您的硬件,则可以将程序设置为 setuid root。但是,请不要在生产系统上安装它们 setuid;它们是设计上的安全漏洞。

The programs can be made setuid root, if you want to live dangerously and play with your hardware without acquiring explicit privileges. Please do not install them setuid on a production system, however; they are a security hole by design.

字符串操作

String Operations

除了单次输入和输出操作之外,一些处理器还实现特殊指令,以在单个 I/O 端口或相同大小的端口之间传输字节、字或长整型序列。这些就是所谓的字符串指令,并且它们执行任务的速度比 C 语言循环更快。以下宏通过使用单个机器指令或通过执行紧密循环(如果目标处理器没有执行字符串 I/O 的指令)来实现字符串 I/O 的概念。针对 S390 平台编译时根本没有定义宏。这不应该是可移植性问题,因为该平台通常不与其他平台共享设备驱动程序,因为它的外围总线不同。

In addition to the single-shot in and out operations, some processors implement special instructions to transfer a sequence of bytes, words, or longs to and from a single I/O port or the same size. These are the so-called string instructions, and they perform the task more quickly than a C-language loop can do. The following macros implement the concept of string I/O either by using a single machine instruction or by executing a tight loop if the target processor has no instruction that performs string I/O. The macros are not defined at all when compiling for the S390 platform. This should not be a portability problem, since this platform doesn't usually share device drivers with other platforms, because its peripheral buses are different.

字符串函数的原型是:

The prototypes for string functions are:

void insb(unsigned port, void *addr, unsigned long count);

void outsb(unsigned port, void *addr, unsigned long count);
void insb(unsigned port, void *addr, unsigned long count);

void outsb(unsigned port, void *addr, unsigned long count);

count从内存地址开始 读取或写入字节addr。数据从单个端口读取或写入到单个端口port

Read or write count bytes starting at the memory address addr. Data is read from or written to the single port port.

void insw(unsigned port, void *addr, unsigned long count);

void outsw(unsigned port, void *addr, unsigned long count);
void insw(unsigned port, void *addr, unsigned long count);

void outsw(unsigned port, void *addr, unsigned long count);

将 16 位值读取或写入单个 16 位端口。

Read or write 16-bit values to a single 16-bit port.

void insl(unsigned port, void *addr, unsigned long count);

void outsl(unsigned port, void *addr, unsigned long count);
void insl(unsigned port, void *addr, unsigned long count);

void outsl(unsigned port, void *addr, unsigned long count);

将 32 位值读取或写入单个 32 位端口。

Read or write 32-bit values to a single 32-bit port.

使用字符串函数时要记住一件事:它们将直接字节流移入或移出端口。当端口和主机系统具有不同的字节排序规则时,结果可能会令人惊讶。如果需要,使用inw读取端口 会交换字节,以使读取的值与主机顺序匹配。相反,字符串函数不执行此交换。

There is one thing to keep in mind when using the string functions: they move a straight byte stream to or from the port. When the port and the host system have different byte ordering rules, the results can be surprising. Reading a port with inw swaps the bytes, if need be, to make the value read match the host ordering. The string functions, instead, do not perform this swapping.

暂停 I/O

Pausing I/O

某些平台(尤其是 i386)在处理器尝试过快地与总线传输数据或从总线传输数据时可能会出现问题。当处理器相对于外设总线(此处考虑 ISA)超频时,就会出现问题,并且当设备板速度太慢时,问题也会出现。解决方案是在每个 I/O 指令后面插入一个小的延迟(如果后面跟着另一个这样的指令)。在 x86 上,暂停是通过执行out b端口 0x80 的指令(通常但并不总是未使用)或通过忙等待来实现的。有关详细信息,请参阅平台的 asm子目录下的io.h文件。

Some platforms—most notably the i386—can have problems when the processor tries to transfer data too quickly to or from the bus. The problems can arise when the processor is overclocked with respect to the peripheral bus (think ISA here) and can show up when the device board is too slow. The solution is to insert a small delay after each I/O instruction if another such instruction follows. On the x86, the pause is achieved by performing an out b instruction to port 0x80 (normally but not always unused), or by busy waiting. See the io.h file under your platform's asm subdirectory for details.

如果您的设备丢失了一些数据,或者您担心它可能会丢失一些数据,您可以使用暂停功能来代替正常功能。暂停函数与前面列出的函数完全相同,但它们的名称以_p结尾;它们被称为 inb_poutb_p等等。这些函数是为大多数支持的体系结构定义的,尽管它们通常扩展为与非暂停 I/O 相同的代码,因为如果体系结构使用相当现代的外设总线运行,则不需要额外的暂停。

If your device misses some data, or if you fear it might miss some, you can use pausing functions in place of the normal ones. The pausing functions are exactly like those listed previously, but their names end in _p; they are called inb_p, outb_p, and so on. The functions are defined for most supported architectures, although they often expand to the same code as nonpausing I/O, because there is no need for the extra pause if the architecture runs with a reasonably modern peripheral bus.

平台依赖性

Platform Dependencies

I/O 指令本质上是高度依赖处理器的。因为它们涉及处理器如何处理数据移入和移出的细节,所以很难隐藏系统之间的差异。因此,与端口 I/O 相关的大部分源代码都是平台相关的。

I/O instructions are, by their nature, highly processor dependent. Because they work with the details of how the processor handles moving data in and out, it is very hard to hide the differences between systems. As a consequence, much of the source code related to port I/O is platform-dependent.

通过回顾函数列表,您可以看到不兼容性之一,即数据类型,其中根据平台之间的架构差异,参数的类型有所不同。例如,端口位于unsigned shortx86(其中处理器支持 64 KB I/O 空间)上,但 unsigned long在其他平台上,其端口只是与内存相同的地址空间中的特殊位置。

You can see one of the incompatibilities, data typing, by looking back at the list of functions, where the arguments are typed differently based on the architectural differences between platforms. For example, a port is unsigned short on the x86 (where the processor supports a 64-KB I/O space), but unsigned long on other platforms, whose ports are just special locations in the same address space as memory.

其他平台依赖性源于处理器的基本结构差异,因此是不可避免的。我们不会详细讨论这些差异,因为我们假设您不会在不了解底层硬件的情况下为特定系统编写设备驱动程序。相反,这里概述了内核支持的架构的功能:

Other platform dependencies arise from basic structural differences in the processors and are, therefore, unavoidable. We won't go into detail about the differences, because we assume that you won't be writing a device driver for a particular system without understanding the underlying hardware. Instead, here is an overview of the capabilities of the architectures supported by the kernel:

IA-32 (x86)

x86_64
IA-32 (x86)

x86_64

该架构支持本章描述的所有功能。端口号的类型为unsigned short

The architecture supports all the functions described in this chapter. Port numbers are of type unsigned short.

IA-64(安腾)
IA-64 (Itanium)

所有功能均支持;端口是unsigned long(并且是内存映射的)。字符串函数是用 C 实现的。

All functions are supported; ports are unsigned long (and memory-mapped). String functions are implemented in C.

Α
Alpha

支持所有功能,并且端口是内存映射的。根据所使用的芯片组,端口 I/O 的实现在不同的 Alpha 平台中是不同的。字符串函数用 C 实现并在arch/alpha/lib/io.c中定义。端口是unsigned long.

All the functions are supported, and ports are memory-mapped. The implementation of port I/O is different in different Alpha platforms, according to the chipset they use. String functions are implemented in C and defined in arch/alpha/lib/io.c. Ports are unsigned long.

手臂
ARM

端口内存映射,支持所有功能;字符串函数是用 C 实现的。端口的类型为 unsigned int

Ports are memory-mapped, and all functions are supported; string functions are implemented in C. Ports are of type unsigned int.

克里斯
Cris

即使在仿真模式下,该架构也不支持 I/O 端口抽象;各种端口操作被定义为不执行任何操作。

This architecture does not support the I/O port abstraction even in an emulated mode; the various port operations are defined to do nothing at all.

M68k

M68k-nommu
M68k

M68k-nommu

端口是内存映射的。支持字符串函数,端口类型为unsigned char *.

Ports are memory-mapped. String functions are supported, and the port type is unsigned char *.

MIPS

MIPS64
MIPS

MIPS64

MIPS 端口支持所有功能。字符串操作是通过紧密的汇编循环实现的,因为处理器缺乏机器级字符串 I/O。端口是内存映射的;他们是unsigned long

The MIPS port supports all the functions. String operations are implemented with tight assembly loops, because the processor lacks machine-level string I/O. Ports are memory-mapped; they are unsigned long.

精简指令集计算机
PA-RISC

支持所有功能;端口位于int基于 PCI 的系统和unsigned shortEISA 系统上,但使用端口号的字符串操作除外unsigned long

All of the functions are supported; ports are int on PCI-based systems and unsigned short on EISA systems, except for string operations, which use unsigned long port numbers.

PowerPC

PowerPC64
PowerPC

PowerPC64

所有功能均支持;端口 unsigned char *在 32 位系统和unsigned long64 位系统上具有类型。

All the functions are supported; ports have type unsigned char * on 32-bit systems and unsigned long on 64-bit systems.

S390
S390

与 M68k 类似,该平台的标头仅支持字节宽端口 I/O,不支持字符串操作。端口是char指针并且是内存映射的。

Similar to the M68k, the header for this platform supports only byte-wide port I/O with no string operations. Ports are char pointers and are memory-mapped.

超级H
Super-H

端口unsigned int(内存映射),并且支持所有功能。

Ports are unsigned int (memory-mapped), and all the functions are supported.

SPARC64

_
SPARC

SPARC64

再次强调,I/O 空间是内存映射的。端口函数的版本被定义为与unsigned long端口一起使用。

Once again, I/O space is memory-mapped. Versions of the port functions are defined to work with unsigned long ports.

好奇的读者可以从io.h文件中提取更多信息,除了我们在本章中描述的函数之外,这些文件有时还定义了一些特定于体系结构的函数。但请注意,其中一些文件相当难以阅读。

The curious reader can extract more information from the io.h files, which sometimes define a few architecture-specific functions in addition to those we describe in this chapter. Be warned that some of these files are rather difficult reading, however.

有趣的是,x86 系列之外的处理器没有一个具有不同的端口地址空间,尽管几个受支持的系列附带 ISA 和/或 PCI 插槽(并且两种总线都实现独立的 I/O 和内存地址空间)。

It's interesting to note that no processor outside the x86 family features a different address space for ports, even though several of the supported families are shipped with ISA and/or PCI slots (and both buses implement separate I/O and memory address spaces).

此外,一些处理器(尤其是早期的 Alpha)缺乏一次移动一或两个字节的指令。[ 4 ]因此,他们的外设芯片组通过将8位和16位I/O访问映射到内存地址空间中的特殊地址范围来模拟它们。因此, 作用于同一端口的inbinw指令是通过对不同地址进行操作的两个 32 位内存读取来实现的。幸运的是,所有这些都通过本节中描述的宏的内部结构对设备驱动程序编写者隐藏,但我们认为这是一个值得注意的有趣功能。如果您想进一步探究,请在include/asm-alpha/core_lca.h中查找示例

Moreover, some processors (most notably the early Alphas) lack instructions that move one or two bytes at a time.[4] Therefore, their peripheral chipsets simulate 8-bit and 16-bit I/O accesses by mapping them to special address ranges in the memory address space. Thus, an inb and an inw instruction that act on the same port are implemented by two 32-bit memory reads that operate on different addresses. Fortunately, all of this is hidden from the device driver writer by the internals of the macros described in this section, but we feel it's an interesting feature to note. If you want to probe further, look for examples in include/asm-alpha/core_lca.h.

每个平台的 I/O 操作如何执行在每个平台的程序员手册中都有详细描述;这些手册通常可以在网上以 PDF 形式下载。

How I/O operations are performed on each platform is well described in the programmer's manual for each platform; those manuals are usually available for download as PDFs on the Web.

I/O 端口示例

An I/O Port Example

我们用来显示设备驱动程序中的端口 I/O 的示例代码作用于通用数字 I/O 端口;大多数计算机系统中都可以找到此类端口。

The sample code we use to show port I/O from within a device driver acts on general-purpose digital I/O ports; such ports are found in most computer systems.

数字 I/O 端口最常见的形式是字节宽的 I/O 位置,可以是内存映射的,也可以是端口映射的。当您将值写入输出位置时,输出引脚上看到的电信号会根据写入的各个位而变化。当您从输入位置读取值时,输入引脚上看到的当前逻辑电平将作为单独的位值返回。

A digital I/O port, in its most common incarnation, is a byte-wide I/O location, either memory-mapped or port-mapped. When you write a value to an output location, the electrical signal seen on output pins is changed according to the individual bits being written. When you read a value from the input location, the current logic level seen on input pins is returned as individual bit values.

此类 I/O 端口的实际实现和软件接口因系统而异。大多数时候,I/O 引脚由两个 I/O 位置控制:一个位置允许选择哪些引脚用作输入和哪些引脚用作输出,另一个位置可以实际读取或写入逻辑电平。然而,有时事情甚至更简单,这些位被硬连线为输入或输出(但在这种情况下,它们不再称为“通用 I/O”);所有个人计算机上的并行端口就是这样一种不那么通用的 I/O 端口。无论哪种方式,I/O 引脚都可以通过我们稍后介绍的示例代码使用。

The actual implementation and software interface of such I/O ports varies from system to system. Most of the time, I/O pins are controlled by two I/O locations: one that allows selecting what pins are used as input and what pins are used as output and one in which you can actually read or write logic levels. Sometimes, however, things are even simpler, and the bits are hardwired as either input or output (but, in this case, they're no longer called "general-purpose I/O"); the parallel port found on all personal computers is one such not-so-general-purpose I/O port. Either way, the I/O pins are usable by the sample code we introduce shortly.

并行端口概述

An Overview of the Parallel Port

因为我们预计大多数读者会使用称为“个人计算机”形式的 x86 平台,所以我们认为有必要解释一下 PC 并行端口的设计方式。并行端口是在个人计算机上运行数字 I/O 示例代码的首选外围接口。尽管大多数读者可能都有可用的并行端口规范,但为了您的方便,我们在这里总结了它们。

Because we expect most readers to be using an x86 platform in the form called "personal computer," we feel it is worth explaining how the PC parallel port is designed. The parallel port is the peripheral interface of choice for running digital I/O sample code on a personal computer. Although most readers probably have parallel port specifications available, we summarize them here for your convenience.

并行接口的最小配置(我们忽略 ECP 和 EPP 模式)由三个 8 位端口组成。PC 标准在 处启动第一个并行接口的 I/O 端口,在0x378处启动第二个 并行接口的 I/O 端口0x278。第一个端口是双向数据寄存器;它直接连接到物理连接器上的引脚 2-9。第二个端口是只读状态寄存器;当并行端口用于打印机时,该寄存器报告打印机状态的多个方面,例如在线、缺纸或忙碌。第三个端口是仅输出控制寄存器,除其他外,它控制是否启用中断。

The parallel interface, in its minimal configuration (we overlook the ECP and EPP modes) is made up of three 8-bit ports. The PC standard starts the I/O ports for the first parallel interface at 0x378 and for the second at 0x278. The first port is a bidirectional data register; it connects directly to pins 2-9 on the physical connector. The second port is a read-only status register; when the parallel port is being used for a printer, this register reports several aspects of printer status, such as being online, out of paper, or busy. The third port is an output-only control register, which, among other things, controls whether interrupts are enabled.

并行通信中使用的信号电平是标准晶体管-晶体管逻辑 (TTL) 电平:0 伏和 5 伏,逻辑阈值约为 1.2 伏。尽管大多数现代并行端口在电流和电压额定值方面都做得更好,但您可以信赖这些端口至少满足标准 TTL LS 电流额定值。

The signal levels used in parallel communications are standard transistor-transistor logic (TTL) levels: 0 and 5 volts, with the logic threshold at about 1.2 volts. You can count on the ports at least meeting the standard TTL LS current ratings, although most modern parallel ports do better in both current and voltage ratings.

警告

Warning

并行连接器未与计算机的内部电路隔离,如果您想将逻辑门直接连接到端口,这非常有用。但一定要注意正确接线;当您使用自己的定制电路时,并行端口电路很容易损坏,除非您在电路中添加光隔离器。如果您担心会损坏主板,则可以选择使用插入式并行端口。

The parallel connector is not isolated from the computer's internal circuitry, which is useful if you want to connect logic gates directly to the port. But you have to be careful to do the wiring correctly; the parallel port circuitry is easily damaged when you play with your own custom circuitry, unless you add optoisolators to your circuit. You can choose to use plug-in parallel ports if you fear you'll damage your motherboard.

位规格如图 9-1所示。您可以访问 12 个输出位和 5 个输入位,其中一些位在其信号路径过程中逻辑反转。唯一没有关联信号引脚的位是端口 2 的位 4 (0x10),它允许来自并行端口的中断。我们在第 10 章中使用该位作为中断处理程序实现的一部分。

The bit specifications are outlined in Figure 9-1. You can access 12 output bits and 5 input bits, some of which are logically inverted over the course of their signal path. The only bit with no associated signal pin is bit 4 (0x10) of port 2, which enables interrupts from the parallel port. We use this bit as part of our implementation of an interrupt handler in Chapter 10.

并行端口的引脚排列

图 9-1。并行端口的引脚排列

Figure 9-1. The pinout of the parallel port

示例驱动程序

A Sample Driver

我们引入的驱动程序称为简单的硬件操作和原始测试)。它所做的只是从您在加载时选择的端口开始读取和写入几个 8 位端口。默认情况下,它使用分配给 PC 并行接口的端口范围。每个设备节点(具有唯一的次要编号)访问不同的端口。驱动器没有做任何有用的事情;它只是作为作用于端口的单个指令进行隔离以供外部使用。如果不习惯端口I/O,可以使用short来熟悉一下;您可以测量通过端口传输数据或玩其他游戏所需的时间。

The driver we introduce is called short (Simple Hardware Operations and Raw Tests). All it does is read and write a few 8-bit ports, starting from the one you select at load time. By default, it uses the port range assigned to the parallel interface of the PC. Each device node (with a unique minor number) accesses a different port. The short driver doesn't do anything useful; it just isolates for external use as a single instruction acting on a port. If you are not used to port I/O, you can use short to get familiar with it; you can measure the time it takes to transfer data through a port or play other games.

为了让Short在您的系统上工作,它必须能够自由访问底层硬件设备(默认情况下是并行接口);因此,没有其他驱动程序可以分配它。大多数现代发行版将并行端口驱动程序设置为仅在需要时加载的模块,因此 I/O 地址的争用通常不是问题。但是,如果您从 Short中收到“无法获取 I/O 地址”错误(在控制台或系统日志文件中),则其他某些驱动程序可能已经占用了该端口。快速查看/proc/ioports通常会告诉您哪个驱动程序出现问题。如果您不使用并行接口,同样的注意事项也适用于其他 I/O 设备。

For short to work on your system, it must have free access to the underlying hardware device (by default, the parallel interface); thus, no other driver may have allocated it. Most modern distributions set up the parallel port drivers as modules that are loaded only when needed, so contention for the I/O addresses is not usually a problem. If, however, you get a "can't get I/O address" error from short (on the console or in the system log file), some other driver has probably already taken the port. A quick look at /proc/ioports usually tells you which driver is getting in the way. The same caveat applies to other I/O devices if you are not using the parallel interface.

从现在起,我们仅提及“并行接口”以简化讨论。但是,您可以base 在加载时设置模块参数以将Short重定向到其他 I/O 设备。此功能允许示例代码在任何可以访问可通过outbinb访问的数字 I/O 接口的 Linux 平台上运行(即使实际硬件在除 x86 之外的所有平台上都是内存映射的)。稍后,在第 9.4 节中,我们将展示如何将Short与通用内存映射数字 I/O 一起使用。

From now on, we just refer to "the parallel interface" to simplify the discussion. However, you can set the base module parameter at load time to redirect short to other I/O devices. This feature allows the sample code to run on any Linux platform where you have access to a digital I/O interface that is accessible via outb and inb (even though the actual hardware is memory-mapped on all platforms but the x86). Later, in Section 9.4 we show how short can be used with generic memory-mapped digital I/O as well.

要观察并行连接器上发生的情况,并且如果您有点使用硬件的倾向,您可以将一些 LED 焊接到输出引脚上。每个 LED 应串联连接到一个通向接地引脚的 1-K 电阻器(当然,除非您的 LED 内置了该电阻器)。如果将输出引脚连接到输入引脚,您将生成自己的输入以从输入端口读取。

To watch what happens on the parallel connector and if you have a bit of an inclination to work with hardware, you can solder a few LEDs to the output pins. Each LED should be connected in series to a 1-K resistor leading to a ground pin (unless, of course, your LEDs have the resistor built in). If you connect an output pin to an input pin, you'll generate your own input to be read from the input ports.

请注意,您不能仅将打印机连接到并行端口并查看发送到 Short 的数据。该驱动程序实现了对 I/O 端口的简单访问,并且不执行打印机操作数据所需的握手。在下一章中,我们将展示一个示例驱动程序(称为Shortprint),它能够驱动并行打印机;然而,该驱动程序使用中断,因此我们还不能完全了解它。

Note that you cannot just connect a printer to the parallel port and see data sent to short. This driver implements simple access to the I/O ports and does not perform the handshake that printers need to operate on the data. In the next chapter, we show a sample driver (called shortprint), that is capable of driving parallel printers; that driver uses interrupts, however, so we can't get to it quite yet.

如果您打算通过将 LED 焊接到 D 型连接器来查看并行数据,我们建议您不要使用引脚 9 和 10,因为我们稍后将它们连接在一起以运行第 10 章中所示的示例代码。

If you are going to view parallel data by soldering LEDs to a D-type connector, we suggest that you not use pins 9 and 10, because we connect them together later to run the sample code shown in Chapter 10.

short而言,/dev/short0向位于I/O 地址base(0x378,除非在加载时更改)的8 位端口进行写入和读取。 /dev/short1写入位于 的 8 位端口 base + 1,依此类推直至base + 7

As far as short is concerned, /dev/short0 writes to and reads from the 8-bit port located at the I/O address base (0x378 unless changed at load time). /dev/short1 writes to the 8-bit port located at base + 1, and so on up to base + 7.

/dev/short0执行的实际输出操作基于使用outb 的紧密循环。内存屏障指令用于确保输出操作实际发生并且不会被优化掉:

The actual output operation performed by /dev/short0 is based on a tight loop using outb. A memory barrier instruction is used to ensure that the output operation actually takes place and is not optimized away:

while (计数--) {
    outb(*(ptr++), 端口);
    wmb();
}
while (count--) {
    outb(*(ptr++), port);
    wmb(  );
}

您可以运行以下命令来点亮 LED:

You can run the following command to light your LEDs:

echo -n“任何字符串”> /dev/short0
echo  -n "any string"  > /dev/short0

每个 LED 监控输出端口的单个位。请记住,只有最后写入的字符在输出引脚上保持稳定的时间足以被您的眼睛感知。因此,我们建议您通过将-n选项传递给echo来防止自动插入尾随换行符。

Each LED monitors a single bit of the output port. Remember that only the last character written remains steady on the output pins long enough to be perceived by your eyes. For that reason, we suggest that you prevent automatic insertion of a trailing newline by passing the -n option to echo.

读取是由类似的函数执行的,该函数围绕 inb而不是outb构建。为了从并行端口读取“有意义”的值,您需要将一些硬件连接到连接器的输入引脚以生成信号。如果没有信号,您将读取无穷无尽的相同字节流。如果您选择从输出端口读取,您很可能会取回写入该端口的最后一个值(这适用于并行接口和大多数其他常用的数字 I/O 电路)。因此,那些不愿意拿出烙铁的人可以通过运行以下命令来读取端口 0x378 上的当前输出值:

Reading is performed by a similar function, built around inb instead of outb. In order to read "meaningful" values from the parallel port, you need to have some hardware connected to the input pins of the connector to generate signals. If there is no signal, you read an endless stream of identical bytes. If you choose to read from an output port, you most likely get back the last value written to the port (this applies to the parallel interface and to most other digital I/O circuits in common use). Thus, those uninclined to get out their soldering irons can read the current output value on port 0x378 by running a command such as:

dd if=/dev/short0 bs=1 计数=1 | OD-t x1
dd if=/dev/short0 bs=1 count=1 | od -t x1

为了演示所有 I/O 指令的使用,每个 设备都有三种变体:/dev/short0 执行刚才所示的循环,/dev/short0p使用 outb_pinb_p代替“快速”函数,以及/ dev/short0s使用字符串指令。有八个这样的设备,从short0short7。虽然 PC 并行接口只有三个端口,但如果使用不同的 I/O 设备来运行测试,您可能需要更多端口。

To demonstrate the use of all the I/O instructions, there are three variations of each short device: /dev/short0 performs the loop just shown, /dev/short0p uses outb_p and inb_p in place of the "fast" functions, and /dev/short0s uses the string instructions. There are eight such devices, from short0 to short7. Although the PC parallel interface has only three ports, you may need more of them if using a different I/O device to run your tests.

简短的驱动程序执行绝对最少的硬件控制,但足以显示如何使用 I/O 端口指令。有兴趣的读者可能想查看parportparport_pc模块的源代码,了解该设备在现实生活中为了支持 并行端口上的一系列设备(打印机、磁带备份、网络接口)。

The short driver performs an absolute minimum of hardware control but is adequate to show how the I/O port instructions are used. Interested readers may want to look at the source for the parport and parport_pc modules to see how complicated this device can get in real life in order to support a range of devices (printers, tape backup, network interfaces) on the parallel port.

使用 I/O 内存

Using I/O Memory

尽管 I/O 端口在 x86 领域很流行,但与设备通信的主要机制是通过内存映射寄存器和设备内存。两者都称为I/O 存储器,因为寄存器和存储器之间的差异对于软件来说是透明的。

Despite the popularity of I/O ports in the x86 world, the main mechanism used to communicate with devices is through memory-mapped registers and device memory. Both are called I/O memory because the difference between registers and memory is transparent to software.

I/O 内存只是一个类似于 RAM 的位置区域,设备通过总线向处理器提供该区域。该存储器可用于多种用途,例如保存视频数据或以太​​网数据包,以及实现与 I/O 端口类似的设备寄存器(即,它们具有与读写相关的副作用)。

I/O memory is simply a region of RAM-like locations that the device makes available to the processor over the bus. This memory can be used for a number of purposes, such as holding video data or Ethernet packets, as well as implementing device registers that behave just like I/O ports (i.e., they have side effects associated with reading and writing them).

访问 I/O 内存的方式取决于计算机体系结构、总线和所使用的设备,尽管原理在任何地方都是相同的。本章的讨论主要涉及 ISA 和 PCI 内存,同时也尝试传达一般信息。虽然这里介绍了对 PCI 内存的访问,但对 PCI 的全面讨论将推迟到第 12 章

The way to access I/O memory depends on the computer architecture, bus, and device being used, although the principles are the same everywhere. The discussion in this chapter touches mainly on ISA and PCI memory, while trying to convey general information as well. Although access to PCI memory is introduced here, a thorough discussion of PCI is deferred to Chapter 12.

根据所使用的计算机平台和总线,I/O 内存可能会也可能不会通过页表访问。当访问通过页表时,内核必须首先安排物理地址对您的驱动程序可见,这通常意味着您必须调用 ioremap 在进行任何 I/O 操作之前。如果不需要页表,I/O 内存位置看起来非常像 I/O 端口,您可以使用适当的包装函数读取和写入它们。

Depending on the computer platform and bus being used, I/O memory may or may not be accessed through page tables. When access passes though page tables, the kernel must first arrange for the physical address to be visible from your driver, and this usually means that you must call ioremap before doing any I/O. If no page tables are needed, I/O memory locations look pretty much like I/O ports, and you can just read and write to them using proper wrapper functions.

无论是否需要ioremap来访问 I/O 内存,都不鼓励直接使用指向 I/O 内存的指针。尽管(如第 9.1 节中介绍的)I/O 存储器在硬件级别上像普通 RAM 一样寻址,但第 9.1.1 节中概述的额外注意建议避免使用普通指针。用于访问 I/O 内存的包装函数在所有平台上都是安全的,并且只要直接指针取消引用可以执行该操作,就会被优化。

Whether or not ioremap is required to access I/O memory, direct use of pointers to I/O memory is discouraged. Even though (as introduced in Section 9.1) I/O memory is addressed like normal RAM at hardware level, the extra care outlined in the Section 9.1.1 suggests avoiding normal pointers. The wrapper functions used to access I/O memory are safe on all platforms and are optimized away whenever straight pointer dereferencing can perform the operation.

因此,尽管取消引用指针(目前)在 x86 上有效,但未能使用正确的宏会妨碍驱动程序的可移植性和可读性。

Therefore, even though dereferencing a pointer works (for now) on the x86, failure to use the proper macros hinders the portability and readability of the driver.

I/O 内存分配和映射

I/O Memory Allocation and Mapping

I/O 内存区域 必须在使用前分配。内存区域分配的接口(在<linux/ioport.h>中定义)是:

I/O memory regions must be allocated prior to use. The interface for allocation of memory regions (defined in <linux/ioport.h>) is:

结构资源* request_mem_region(无符号长开始,无符号长长度,
                                    字符*名称);
struct resource *request_mem_region(unsigned long start, unsigned long len,
                                    char *name);

该函数分配一个字节内存区域len ,从 开始start。如果一切顺利,NULL则返回一个非指针;否则返回值为NULL. 所有 I/O 内存分配都列在 /proc/iomem中。

This function allocates a memory region of len bytes, starting at start. If all goes well, a non-NULL pointer is returned; otherwise the return value is NULL. All I/O memory allocations are listed in /proc/iomem.

不再需要时应释放​​内存区域:

Memory regions should be freed when no longer needed:

voidrelease_mem_region(无符号长开始,无符号长长度);
void release_mem_region(unsigned long start, unsigned long len);

还有一个用于检查 I/O 内存区域可用性的旧函数:

There is also an old function for checking I/O memory region availability:

int check_mem_region(无符号长开始,无符号长长度);
int check_mem_region(unsigned long start, unsigned long len);

但是,与check_region一样一样,这个函数是不安全的,应该避免。

But, as with check_region, this function is unsafe and should be avoided.

在访问内存之前,分配 I/O 内存并不是唯一需要的步骤。您还必须确保内核可以访问此 I/O 内存。获取 I/O 内存不仅仅是取消引用指针的问题;还涉及到 I/O 内存的获取。在许多系统上,I/O 内存根本无法通过这种方式直接访问。因此必须首先建立映射。这就是ioremap函数的作用,在 第8章8.4节介绍过。该函数专门设计用于将虚拟地址分配给 I/O 内存区域。

Allocation of I/O memory is not the only required step before that memory may be accessed. You must also ensure that this I/O memory has been made accessible to the kernel. Getting at I/O memory is not just a matter of dereferencing a pointer; on many systems, I/O memory is not directly accessible in this way at all. So a mapping must be set up first. This is the role of the ioremap function, introduced in Section 8.4 in Chapter 8. The function is designed specifically to assign virtual addresses to I/O memory regions.

一旦配备了ioremap(和iounmap),设备驱动程序就可以访问任何 I/O 内存地址,无论它是否直接映射到虚拟地址空间。但请记住,从ioremap返回的地址 不应直接取消引用;相反,应该使用内核提供的访问器函数。在讨论这些函数之前,我们最好回顾一下ioremap原型并介绍一些我们在上一章中忽略的细节。

Once equipped with ioremap (and iounmap), a device driver can access any I/O memory address, whether or not it is directly mapped to virtual address space. Remember, though, that the addresses returned from ioremap should not be dereferenced directly; instead, accessor functions provided by the kernel should be used. Before we get into those functions, we'd better review the ioremap prototypes and introduce a few details that we passed over in the previous chapter.

根据以下定义调用函数:

The functions are called according to the following definition:

#include <asm/io.h>
void *ioremap(unsigned long phys_addr, unsigned long size);
void *ioremap_nocache(unsigned long phys_addr, unsigned long size);
无效iounmap(无效*地址);
#include <asm/io.h>
void *ioremap(unsigned long phys_addr, unsigned long size);
void *ioremap_nocache(unsigned long phys_addr, unsigned long size);
void iounmap(void * addr);

首先,您会注意到新函数ioremap_nocache。我们没有在第 8 章中介绍它,因为它的含义肯定与硬件相关。引用内核头文件之一:“如果某些控制寄存器位于这样的区域,则很有用,并且写组合或读缓存是不可取的。” 实际上,该函数的实现与大多数计算机平台上的ioremap相同 :在所有 I/O 内存都已通过不可缓存地址可见的情况下,没有理由实现ioremap的单独的非缓存版本。

First of all, you notice the new function ioremap_nocache. We didn't cover it in Chapter 8, because its meaning is definitely hardware related. Quoting from one of the kernel headers: "It's useful if some control registers are in such an area, and write combining or read caching is not desirable." Actually, the function's implementation is identical to ioremap on most computer platforms: in situations where all of I/O memory is already visible through noncacheable addresses, there's no reason to implement a separate, noncaching version of ioremap.

访问 I/O 内存

Accessing I/O Memory

在某些平台上,您使用ioremap的返回值作为指针可能会逃脱惩罚 。这种使用是不可移植的,并且内核开发人员越来越多地致力于消除任何此类使用。获取 I/O 内存的正确方法是通过为此目的提供的一组函数(通过<asm/io.h>定义)。

On some platforms, you may get away with using the return value from ioremap as a pointer. Such use is not portable, and, increasingly, the kernel developers have been working to eliminate any such use. The proper way of getting at I/O memory is via a set of functions (defined via <asm/io.h>) provided for that purpose.

要从 I/O 内存读取,请使用以下方法之一:

To read from I/O memory, use one of the following:

无符号 int ioread8(void *addr);
无符号 int ioread16(void *addr);
无符号 int ioread32(void *addr);
unsigned int ioread8(void *addr);
unsigned int ioread16(void *addr);
unsigned int ioread32(void *addr);

这里,应该是从ioremapaddr获得的地址 (可能带有整数偏移量);返回值是从给定 I/O 内存中读取的值。

Here, addr should be an address obtained from ioremap (perhaps with an integer offset); the return value is what was read from the given I/O memory.

有一组类似的函数用于写入 I/O 内存:

There is a similar set of functions for writing to I/O memory:

void iowrite8(u8 值, void *addr);
void iowrite16(u16 值, void *addr);
void iowrite32(u32 值, void *addr);
void iowrite8(u8 value, void *addr);
void iowrite16(u16 value, void *addr);
void iowrite32(u32 value, void *addr);

如果必须向给定 I/O 内存地址读取或写入一系列值,则可以使用函数的重复版本:

If you must read or write a series of values to a given I/O memory address, you can use the repeating versions of the functions:

void ioread8_rep(void *addr, void *buf, 无符号长计数);
void ioread16_rep(void *addr, void *buf, 无符号长计数);
void ioread32_rep(void *addr, void *buf, 无符号长计数);
void iowrite8_rep(void *addr, const void *buf, 无符号长计数);
void iowrite16_rep(void *addr, const void *buf, 无符号长计数);
void iowrite32_rep(void *addr, const void *buf, 无符号长计数);
void ioread8_rep(void *addr, void *buf, unsigned long count);
void ioread16_rep(void *addr, void *buf, unsigned long count);
void ioread32_rep(void *addr, void *buf, unsigned long count);
void iowrite8_rep(void *addr, const void *buf, unsigned long count);
void iowrite16_rep(void *addr, const void *buf, unsigned long count);
void iowrite32_rep(void *addr, const void *buf, unsigned long count);

这些函数count从给定的值读取或写入给buf定的值addr。请注意,count以正在写入的数据的大小表示;ioread32_rep读取count从 开始的 32 位值buf

These functions read or write count values from the given buf to the given addr. Note that count is expressed in the size of the data being written; ioread32_rep reads count 32-bit values starting at buf.

上述函数执行给定的所有 I/O addr。相反,如果您需要对 I/O 内存块进行操作,则可以使用以下方法之一:

The functions described above perform all I/O to the given addr. If, instead, you need to operate on a block of I/O memory, you can use one of the following:

void memset_io(void *addr, u8 值, unsigned int count);
void memcpy_fromio(void *dest, void *source, unsigned int count);
void memcpy_toio(void *dest, void *source, unsigned int count);
void memset_io(void *addr, u8 value, unsigned int count);
void memcpy_fromio(void *dest, void *source, unsigned int count);
void memcpy_toio(void *dest, void *source, unsigned int count);

这些函数的行为类似于它们的 C 库类似函数。

These functions behave like their C library analogs.

如果您通读内核源代码,您会看到在使用 I/O 内存时对一组旧函数的许多调用。这些函数仍然有效,但不鼓励在新代码中使用它们。除此之外,它们的安全性较低,因为它们不执行相同类型的检查。尽管如此,我们在这里描述它们:

If you read through the kernel source, you see many calls to an older set of functions when I/O memory is being used. These functions still work, but their use in new code is discouraged. Among other things, they are less safe because they do not perform the same sort of type checking. Nonetheless, we describe them here:

unsigned readb(address);

unsigned readw(address);

unsigned readl(address);
unsigned readb(address);

unsigned readw(address);

unsigned readl(address);

这些宏用于从 I/O 存储器检索 8 位、16 位和 32 位数据值。

These macros are used to retrieve 8-bit, 16-bit, and 32-bit data values from I/O memory.

void writeb(unsigned value, address);

void writew(unsigned value, address);

void writel(unsigned value, address);
void writeb(unsigned value, address);

void writew(unsigned value, address);

void writel(unsigned value, address);

与前面的函数一样,这些函数(宏)用于写入 8 位、16 位和 32 位数据项。

Like the previous functions, these functions (macros) are used to write 8-bit, 16-bit, and 32-bit data items.

一些 64 位平台还提供readqwriteq,用于 PCI 总线上的四字(8 字节)内存操作。四字命名法是所有实际处理器都具有 16 位字时代的历史遗留物。实际上,用于 32 位值的L命名也变得不正确,但重命名所有内容会使事情更加混乱。

Some 64-bit platforms also offer readq and writeq, for quad-word (8-byte) memory operations on the PCI bus. The quad-word nomenclature is a historical leftover from the times when all real processors had 16-bit words. Actually, the L naming used for 32-bit values has become incorrect too, but renaming everything would confuse things even more.

端口作为 I/O 存储器

Ports as I/O Memory

一些硬件有一个有趣的功能:一些版本使用 I/O 端口,而其他版本则使用 I/O 内存。两种情况下导出到处理器的寄存器都是相同的,但访问方法不同。作为一种让处理此类硬件的驱动程序变得更轻松的方法,并且作为一种最小化 I/O 端口和内存访问之间的明显差异的方法,2.6 内核提供了一个名为ioport_map的函数 :

Some hardware has an interesting feature: some versions use I/O ports, while others use I/O memory. The registers exported to the processor are the same in either case, but the access method is different. As a way of making life easier for drivers dealing with this kind of hardware, and as a way of minimizing the apparent differences between I/O port and memory accesses, the 2.6 kernel provides a function called ioport_map:

void *ioport_map(无符号长端口,无符号整数计数);
void *ioport_map(unsigned long port, unsigned int count);

该函数重新映射countI/O 端口并使它们看起来像 I/O 内存。从那时起,驱动程序可能会在返回的地址上使用 ioread8等,并忘记它正在使用 I/O 端口。

This function remaps count I/O ports and makes them appear to be I/O memory. From that point thereafter, the driver may use ioread8 and friends on the returned addresses and forget that it is using I/O ports at all.

当不再需要此映射时,应撤消它:

This mapping should be undone when it is no longer needed:

无效ioport_unmap(无效*地址);
void ioport_unmap(void *addr);

这些功能使 I/O 端口看起来像存储器。但请注意,I/O 端口仍必须使用request_region进行分配,然后才能以这种方式重新映射。

These functions make I/O ports look like memory. Do note, however, that the I/O ports must still be allocated with request_region before they can be remapped in this way.

重用 I/O 内存缩写

Reusing short for I/O Memory

前面介绍的用于访问 I/O 端口的简短示例模块 也可用于访问 I/O 内存。为此,您必须告诉它在加载时使用 I/O 内存;另外,您需要更改基地址以使其指向您的 I/O 区域。

The short sample module, introduced earlier to access I/O ports, can be used to access I/O memory as well. To this aim, you must tell it to use I/O memory at load time; also, you need to change the base address to make it point to your I/O region.

例如,这就是我们如何使用短路来点亮 MIPS 开发板上的调试 LED:

For example, this is how we used short to light the debug LEDs on a MIPS development board:

mips.root# ./short_load use_mem=1 base=0xb7ffffc0
mips.root#echo -n 7 > /dev/short0
mips.root# ./short_load use_mem=1 base=0xb7ffffc0
mips.root# echo -n 7 > /dev/short0

I/O 内存缩写的使用与 I/O 端口的缩写相同。

Use of short for I/O memory is the same as it is for I/O ports.

以下片段显示了Short在写入内存位置时使用的循环:

The following fragment shows the loop used by short in writing to a memory location:

while (计数--) {
    iowrite8(*ptr++, 地址);
    wmb();
}
while (count--) {
    iowrite8(*ptr++, address);
    wmb(  );
}

请注意此处使用了写内存屏障。由于iowrite8 在许多体系结构上可能会变成直接分配,因此需要内存屏障来确保写入按预期顺序发生。

Note the use of a write memory barrier here. Because iowrite8 likely turns into a direct assignment on many architectures, the memory barrier is needed to ensure that the writes happen in the expected order.

Short使用inboutb 来展示这是如何完成的。然而,对于读者来说,使用ioport_map更改Short以重新映射 I/O 端口 ,并大大简化其余代码将是一个简单的练习。

short uses inb and outb to show how that is done. It would be a straightforward exercise for the reader, however, to change short to remap I/O ports with ioport_map, and simplify the rest of the code considerably.

ISA 内存低于 1 MB

ISA Memory Below 1 MB

最著名的 I/O 内存区域之一是个人计算机上的 ISA 系列。0xA0000这是介于 640 KB ( ) 和 1 MB ( )之间的内存范围0x100000。因此,它出现在常规系统 RAM 的中间。这个定位可能看起来有点奇怪;这是 20 世纪 80 年代初做出的决定的产物,当时 640 KB 的内存似乎超出了任何人的使用能力。

One of the most well-known I/O memory regions is the ISA range found on personal computers. This is the memory range between 640 KB (0xA0000) and 1 MB (0x100000). Therefore, it appears right in the middle of regular system RAM. This positioning may seem a little strange; it is an artifact of a decision made in the early 1980s, when 640 KB of memory seemed like more than anybody would ever be able to use.

该内存范围属于非直接映射类内存。[ 5 ]您可以使用前面解释的模块在该内存范围中读/写几个字节 ,即通过use_mem在加载时设置。

This memory range belongs to the non-directly-mapped class of memory.[5] You can read/write a few bytes in that memory range using the short module as explained previously, that is, by setting use_mem at load time.

尽管 ISA I/O 内存仅存在于 x86 级计算机中,但我们认为值得花几句话和一个示例驱动程序来介绍它。

Although ISA I/O memory exists only in x86-class computers, we think it's worth spending a few words and a sample driver on it.

我们不会在本章中讨论 PCI 内存,因为它是最干净的 I/O 内存:一旦知道物理地址,就可以简单地重新映射和访问它。PCI I/O 内存的“问题”在于它不适合作为本章的工作示例,因为我们无法提前知道 PCI 内存映射到的物理地址,或者访问是否安全这些范围中的任何一个。我们选择描述 ISA 内存范围,因为它不太干净,而且更适合运行示例代码。

We are not going to discuss PCI memory in this chapter, since it is the cleanest kind of I/O memory: once you know the physical address, you can simply remap and access it. The "problem" with PCI I/O memory is that it doesn't lend itself to a working example for this chapter, because we can't know in advance the physical addresses your PCI memory is mapped to, or whether it's safe to access either of those ranges. We chose to describe the ISA memory range, because it's both less clean and more suitable to running sample code.

为了演示对 ISA 内存的访问,我们使用了另一个愚蠢的小模块(示例源的一部分)。事实上,这个工具被称为“silly”,是“Simple Tool for Unloading and Printing ISA Data”的缩写,或者类似的名称。

To demonstrate access to ISA memory, we use yet another silly little module (part of the sample sources). In fact, this one is called silly, as an acronym for Simple Tool for Unloading and Printing ISA Data, or something like that.

该模块通过提供对整个 384 KB 内存空间的访问并显示所有不同的 I/O 功能来补充Short的功能。它具有四个设备节点,使用不同的数据传输功能执行相同的任务。这些愚蠢的设备充当 I/O 内存的窗口,其方式类似于/dev/mem。您可以读取和写入数据,并查找任意 I/O 内存地址。

The module supplements the functionality of short by giving access to the whole 384-KB memory space and by showing all the different I/O functions. It features four device nodes that perform the same task using different data transfer functions. The silly devices act as a window over I/O memory, in a way similar to /dev/mem. You can read and write data, and lseek to an arbitrary I/O memory address.

因为silly提供了对ISA内存的访问,所以它必须首先将物理ISA地址映射到内核虚拟地址。在 Linux 内核的早期,人们可以简单地分配一个指向感兴趣的 ISA 地址的指针,然后直接取消引用它。然而,在现代世界中,我们必须使用虚拟内存系统并首先重新映射内存范围。此映射是通过 ioremap完成的,如前所述 :

Because silly provides access to ISA memory, it must start by mapping the physical ISA addresses into kernel virtual addresses. In the early days of the Linux kernel, one could simply assign a pointer to an ISA address of interest, then dereference it directly. In the modern world, though, we must work with the virtual memory system and remap the memory range first. This mapping is done with ioremap, as explained earlier for short:

#定义ISA_BASE 0xA0000
#define ISA_MAX 0x100000 /* 用于一般内存访问 */

    /* 这行出现在silly_init */
    io_base = ioremap(ISA_BASE, ISA_MAX - ISA_BASE);
#define ISA_BASE    0xA0000
#define ISA_MAX     0x100000  /* for general memory access */

    /* this line appears in silly_init */
    io_base = ioremap(ISA_BASE, ISA_MAX - ISA_BASE);

ioremap返回一个指针值,可以与 ioread8和第 9.4.2 节中解释的其他函数。

ioremap returns a pointer value that can be used with ioread8 and the other functions explained in Section 9.4.2.

让我们回顾一下示例模块,看看如何使用这些函数。 /dev/sillyb具有次要编号,使用ioread8iowrite80访问 I/O 内存。以下代码显示了read的实现 ,它使地址范围可用作该范围中的虚拟文件。读取函数被构造为针对不同访问模式的语句这是愚蠢的0xA0000-0xFFFFF0-0x5FFFFswitch case

Let's look back at our sample module to see how these functions might be used. /dev/sillyb, featuring minor number 0, accesses I/O memory with ioread8 and iowrite8. The following code shows the implementation for read, which makes the address range 0xA0000-0xFFFFF available as a virtual file in the range 0-0x5FFFF. The read function is structured as a switch statement over the different access modes; here is the sillyb case:

案例M_8:
  同时(计数){
      *ptr = ioread8(添加);
      添加++;
      数数 - ;
      ptr++;
  }
  休息;
case M_8: 
  while (count) {
      *ptr = ioread8(add);
      add++;
      count--;
      ptr++;
  }
  break;

接下来的两个设备是/dev/sillyw(次要编号 1)和/dev/sillyl(次要编号 2)。它们的行为类似于/dev/sillyb,只不过它们使用 16 位和 32 位函数。这是sillyl写入实现,也是a 的一部分:switch

The next two devices are /dev/sillyw (minor number 1) and /dev/sillyl (minor number 2). They act like /dev/sillyb, except that they use 16-bit and 32-bit functions. Here's the write implementation of sillyl, again part of a switch:

案例M_32:
  while (计数 >= 4) {
      iowrite8(*(u32 *)ptr, 添加);
      添加 += 4;
      计数 -= 4;
      指针+= 4;
  }
  休息;
case M_32: 
  while (count >= 4) {
      iowrite8(*(u32 *)ptr, add);
      add += 4;
      count -= 4;
      ptr += 4;
  }
  break;

最后一个设备是/dev/sillycp(次要编号 3),它使用memcpy_*io函数来执行相同的任务。这是其读取实现的核心:

The last device is /dev/sillycp (minor number 3), which uses the memcpy_*io functions to perform the same task. Here's the core of its read implementation:

案例M_memcpy:
  memcpy_fromio(ptr, 添加, 计数);
  休息;
case M_memcpy:
  memcpy_fromio(ptr, add, count);
  break;

因为ioremap用于提供对 ISA 内存区域的访问,所以当模块卸载时,silly必须调用iounmap :

Because ioremap was used to provide access to the ISA memory area, silly must invoke iounmap when the module is unloaded:

iounmap(io_base);
iounmap(io_base);

isa_readb 和朋友

isa_readb and Friends

查看内核源代码会发现另一组例程,其名称如 isa_readb。事实上,刚才描述的每个函数都有一个 isa_等效函数。这些函数提供对 ISA 内存的访问,无需单独的ioremap步骤。然而,内核开发人员的说法是,这些函数旨在作为临时驱动程序移植辅助工具,并且 他们将来可能会消失。因此,您应该避免使用它们。

A look at the kernel source will turn up another set of routines with names such as isa_readb. In fact, each of the functions just described has an isa_ equivalent. These functions provide access to ISA memory without the need for a separate ioremap step. The word from the kernel developers, however, is that these functions are intended to be temporary driver-porting aids and that they may go away in the future. Therefore, you should avoid using them.

快速参考

Quick Reference

本章介绍了以下与硬件管理相关的符号:

This chapter introduced the following symbols related to hardware management:

#include <linux/kernel.h>

void barrier(void)
#include <linux/kernel.h>

void barrier(void)

这个“软件” 内存屏障要求 编译器考虑该指令中的所有内存易失性。

This "software" memory barrier requests the compiler to consider all memory volatile across this instruction.

#include <asm/system.h>

void rmb(void);

void read_barrier_depends(void);

void wmb(void);

void mb(void);
#include <asm/system.h>

void rmb(void);

void read_barrier_depends(void);

void wmb(void);

void mb(void);

硬件内存屏障。它们请求 CPU(和编译器)对该指令上的所有内存读取、写入或两者进行检查点。

Hardware memory barriers. They request the CPU (and the compiler) to checkpoint all memory reads, writes, or both across this instruction.

#include <asm/io.h>

unsigned inb(unsigned port);

void outb(unsigned char byte, unsigned port);

unsigned inw(unsigned port);

void outw(unsigned short word, unsigned port);

unsigned inl(unsigned port);

void outl(unsigned doubleword, unsigned port);
#include <asm/io.h>

unsigned inb(unsigned port);

void outb(unsigned char byte, unsigned port);

unsigned inw(unsigned port);

void outw(unsigned short word, unsigned port);

unsigned inl(unsigned port);

void outl(unsigned doubleword, unsigned port);

用于读取和写入 I/O 端口的函数。只要用户空间程序具有正确的权限,它们也可以被调用 访问端口。

Functions that are used to read and write I/O ports. They can also be called by user-space programs, provided they have the right privileges to access ports.

unsigned inb_p(unsigned port);

..
unsigned inb_p(unsigned port);

...

如果在 I/O 操作后需要较小的延迟,您可以使用上一篇文章中介绍的六个暂停函数;这些暂停函数的名称以_p结尾。

If a small delay is needed after an I/O operation, you can use the six pausing counterparts of the functions introduced in the previous entry; these pausing functions have names ending in _p.

void insb(unsigned port, void *addr, unsigned long count);

void outsb(unsigned port, void *addr, unsigned long count);

void insw(unsigned port, void *addr, unsigned long count);

void outsw(unsigned port, void *addr, unsigned long count);

void insl(unsigned port, void *addr, unsigned long count);

void outsl(unsigned port, void *addr, unsigned long count);
void insb(unsigned port, void *addr, unsigned long count);

void outsb(unsigned port, void *addr, unsigned long count);

void insw(unsigned port, void *addr, unsigned long count);

void outsw(unsigned port, void *addr, unsigned long count);

void insl(unsigned port, void *addr, unsigned long count);

void outsl(unsigned port, void *addr, unsigned long count);

“字符串函数”是 优化以将数据从输入端口传输到内存区域,或者反之亦然。此类转让 通过读取或写入相同的端口时间来执行count

The "string functions" are optimized to transfer data from an input port to a region of memory, or the other way around. Such transfers are performed by reading or writing the same port count times.

#include <linux/ioport.h>

struct resource *request_region(unsigned long start, unsigned long len, char

*name);

void release_region(unsigned long start, unsigned long len);

int check_region(unsigned long start, unsigned long len);
#include <linux/ioport.h>

struct resource *request_region(unsigned long start, unsigned long len, char

*name);

void release_region(unsigned long start, unsigned long len);

int check_region(unsigned long start, unsigned long len);

I/O 端口的资源分配器。(已弃用的)检查 函数在成功时返回,在出错时0返回小于值 。0

Resource allocators for I/O ports. The (deprecated) check function returns 0 for success and less than 0 in case of error.

struct resource *request_mem_region(unsigned long start, unsigned long len,

char *name);

void release_mem_region(unsigned long start, unsigned long len);

int check_mem_region(unsigned long start, unsigned long len);
struct resource *request_mem_region(unsigned long start, unsigned long len,

char *name);

void release_mem_region(unsigned long start, unsigned long len);

int check_mem_region(unsigned long start, unsigned long len);

处理内存区域资源分配的函数。

Functions that handle resource allocation for memory regions.

#include <asm/io.h>

void *ioremap(unsigned long phys_addr, unsigned long size);

void *ioremap_nocache(unsigned long phys_addr, unsigned long size);

void iounmap(void *virt_addr);
#include <asm/io.h>

void *ioremap(unsigned long phys_addr, unsigned long size);

void *ioremap_nocache(unsigned long phys_addr, unsigned long size);

void iounmap(void *virt_addr);

映射 将物理地址范围重新映射到处理器的虚拟地址空间,使其可供内核使用。iounmap在不再需要映射时释放该映射。

ioremap remaps a physical address range into the processor's virtual address space, making it available to the kernel. iounmap frees the mapping when it is no longer needed.

#include <asm/io.h>

unsigned int ioread8(void *addr);

unsigned int ioread16(void *addr);

unsigned int ioread32(void *addr);

void iowrite8(u8 value, void *addr);

void iowrite16(u16 value, void *addr);

void iowrite32(u32 value, void *addr);
#include <asm/io.h>

unsigned int ioread8(void *addr);

unsigned int ioread16(void *addr);

unsigned int ioread32(void *addr);

void iowrite8(u8 value, void *addr);

void iowrite16(u16 value, void *addr);

void iowrite32(u32 value, void *addr);

用于处理 I/O 内存的访问器函数。

Accessor functions that are used to work with I/O memory.

void ioread8_rep(void *addr, void *buf, unsigned long count);

void ioread16_rep(void *addr, void *buf, unsigned long count);

void ioread32_rep(void *addr, void *buf, unsigned long count);

void iowrite8_rep(void *addr, const void *buf, unsigned long count);

void iowrite16_rep(void *addr, const void *buf, unsigned long count);

void iowrite32_rep(void *addr, const void *buf, unsigned long count);
void ioread8_rep(void *addr, void *buf, unsigned long count);

void ioread16_rep(void *addr, void *buf, unsigned long count);

void ioread32_rep(void *addr, void *buf, unsigned long count);

void iowrite8_rep(void *addr, const void *buf, unsigned long count);

void iowrite16_rep(void *addr, const void *buf, unsigned long count);

void iowrite32_rep(void *addr, const void *buf, unsigned long count);

I/O 内存原语的“重复”版本。

"Repeating" versions of the I/O memory primitives.

unsigned readb(address);

unsigned readw(address);

unsigned readl(address);

void writeb(unsigned value, address);

void writew(unsigned value, address);

void writel(unsigned value, address);

memset_io(address, value, count);

memcpy_fromio(dest, source, nbytes);

memcpy_toio(dest, source, nbytes);
unsigned readb(address);

unsigned readw(address);

unsigned readl(address);

void writeb(unsigned value, address);

void writew(unsigned value, address);

void writel(unsigned value, address);

memset_io(address, value, count);

memcpy_fromio(dest, source, nbytes);

memcpy_toio(dest, source, nbytes);

用于访问 I/O 内存的较旧的、类型不安全的函数。

Older, type-unsafe functions for accessing I/O memory.

void *ioport_map(unsigned long port, unsigned int count);

void ioport_unmap(void *addr);
void *ioport_map(unsigned long port, unsigned int count);

void ioport_unmap(void *addr);

想要将 I/O 端口视为 I/O 内存的驱动程序作者可以将这些端口传递给ioport_map。当不再需要时,应该完成映射(使用 ioport_unmap )。

A driver author that wants to treat I/O ports as if they were I/O memory may pass those ports to ioport_map. The mapping should be done (with ioport_unmap) when no longer needed.




[ 1 ]并非所有计算机平台都使用读和写信号 ;有些有不同的方式来寻址外部电路。然而,这种差异在软件级别上是无关紧要的,我们假设所有人都已阅读编写以简化讨论。

[1] Not all computer platforms use a read and a write signal; some have different means to address external circuits. The difference is irrelevant at software level, however, and we'll assume all have read and write to simplify the discussion.

[ 2 ]有时,I/O 端口的排列方式类似于内存,并且您可以(例如)将两个 8 位写入绑定到单个 16 位操作中。例如,这适用于 PC 视频板。但一般来说,您不能指望此功能。

[2] Sometimes I/O ports are arranged like memory, and you can (for example) bind two 8-bit writes into a single 16-bit operation. This applies, for instance, to PC video boards. But generally, you can't count on this feature.

[ 3 ]从技术上讲,它必须具有该CAP_SYS_RAWIO 功能,但这与在大多数当前系统上以 root 身份运行相同。

[3] Technically, it must have the CAP_SYS_RAWIO capability, but that is the same as running as root on most current systems.

[ 4 ]单字节 I/O 并不像人们想象的那么重要,因为它是一种很少见的操作。要将单个字节读/写到任何地址空间,您需要实现一个数据路径,将寄存器集数据总线的低位连接到外部数据总线中的任何字节位置。这些数据路径需要额外的逻辑门,从而妨碍每次数据传输。放弃字节宽度的加载和存储可以有利于整体系统性能。

[4] Single-byte I/O is not as important as one may imagine, because it is a rare operation. To read/write a single byte to any address space, you need to implement a data path connecting the low bits of the register-set data bus to any byte position in the external data bus. These data paths require additional logic gates that get in the way of every data transfer. Dropping byte-wide loads and stores can benefit overall system performance.

[ 5 ]事实上,这并不完全正确。内存范围如此之小且使用如此频繁,以至于内核在引导时构建页表来访问这些地址。然而,用于访问它们的虚拟地址与物理地址不同,因此无论如何都需要ioremap 。

[5] Actually, this is not completely true. The memory range is so small and so frequently used that the kernel builds page tables at boot time to access those addresses. However, the virtual address used to access them is not the same as the physical address, and thus ioremap is needed anyway.

第 10 章中断处理

Chapter 10. Interrupt Handling

尽管有些设备只使用其 I/O 区域即可进行控制,但大多数实际设备都比这要复杂一些。设备必须与外部世界打交道,其中通常包括旋转磁盘、移动磁带、连接远方的电线等。许多工作必须在与处理器不同且慢得多的时间范围内完成。由于几乎总是不希望让处理器等待外部事件,因此设备必须有一种方法让处理器知道发生了什么事。

Although some devices can be controlled using nothing but their I/O regions, most real devices are a bit more complicated than that. Devices have to deal with the external world, which often includes things such as spinning disks, moving tape, wires to distant places, and so on. Much has to be done in a time frame that is different from, and far slower than, that of the processor. Since it is almost always undesirable to have the processor wait on external events, there must be a way for a device to let the processor know when something has happened.

当然,这种方式就是中断。中断只是硬件在需要处理器注意时可以发送的信号Linux 处理中断的方式与处理用户空间信号的方式大致相同。大多数情况下,驱动程序只需为其设备的中断注册一个处理程序,并在中断到达时正确处理它们。当然,在这幅简单的图景之下,隐藏着一些复杂性。特别是,由于中断处理程序的运行方式,它们可以执行的操作在某种程度上受到限制。

That way, of course, is interrupts. An interrupt is simply a signal that the hardware can send when it wants the processor's attention. Linux handles interrupts in much the same way that it handles signals in user space. For the most part, a driver need only register a handler for its device's interrupts, and handle them properly when they arrive. Of course, underneath that simple picture there is some complexity; in particular, interrupt handlers are somewhat limited in the actions they can perform as a result of how they are run.

如果没有真正的硬件设备来生成中断,则很难演示中断的使用。因此,本章中使用的示例代码适用于并行端口。此类端口在现代硬件上开始变得稀缺,但幸运的是,大多数人仍然能够使用具有可用端口的系统。我们将使用 上一章中的简短模块;通过一些小的添加,它可以生成并处理来自并行端口的中断。该模块的名称, short,实际上意味着short int(它是C语言,不是吗?),以提醒我们它处理中断

It is difficult to demonstrate the use of interrupts without a real hardware device to generate them. Thus, the sample code used in this chapter works with the parallel port. Such ports are starting to become scarce on modern hardware, but, with luck, most people are still able to get their hands on a system with an available port. We'll be working with the short module from the previous chapter; with some small additions it can generate and handle interrupts from the parallel port. The module's name, short, actually means short int (it is C, isn't it?), to remind us that it handles interrupts.

然而,在我们进入主题之前,是时候提出一个警告了。中断处理程序本质上是与其他代码同时运行的。因此,它们不可避免地会引发数据结构和硬件的并发和争用问题。如果您经不住诱惑而跳过第 5 章中的讨论,我们理解。但我们也建议您现在回头再看一下。在处理中断时,对并发控制技术的深入理解至关重要。

Before we get into the topic, however, it is time for one cautionary note. Interrupt handlers, by their nature, run concurrently with other code. Thus, they inevitably raise issues of concurrency and contention for data structures and hardware. If you succumbed to the temptation to pass over the discussion in Chapter 5, we understand. But we also recommend that you turn back and have another look now. A solid understanding of concurrency control techniques is vital when working with interrupts.

准备并行端口

Preparing the Parallel Port

虽然并行接口很简单, 它可以触发中断。打印机使用此功能来通知lp驱动程序它已准备好接受缓冲区中的下一个字符。

Although the parallel interface is simple, it can trigger interrupts. This capability is used by the printer to notify the lp driver that it is ready to accept the next character in the buffer.

与大多数设备一样,并行端口在收到指示之前实际上并不生成中断;并行标准规定,设置端口 2 的位 4(0x37a0x27a或其他)可启用中断报告。模块初始化时通过Short执行简单的 outb调用来设置该位 。

Like most devices, the parallel port doesn't actually generate interrupts before it's instructed to do so; the parallel standard states that setting bit 4 of port 2 (0x37a, 0x27a, or whatever) enables interrupt reporting. A simple outb call to set the bit is performed by short at module initialization.

一旦中断被使能,每当引脚10(所谓的ACK 位)处的电信号从低电平变为高电平时,并行接口就会生成中断。强制接口产生中断的最简单方法(无需将打印机连接到端口)是连接并行连接器的引脚 9 和 10。将一小段电线插入系统背面并行端口连接器的相应孔中即可创建此连接。并行端口的引脚排列如图9-1所示。

Once interrupts are enabled, the parallel interface generates an interrupt whenever the electrical signal at pin 10 (the so-called ACK bit) changes from low to high. The simplest way to force the interface to generate interrupts (short of hooking up a printer to the port) is to connect pins 9 and 10 of the parallel connector. A short length of wire inserted into the appropriate holes in the parallel port connector on the back of your system creates this connection. The pinout of the parallel port is shown in Figure 9-1.

引脚 9 是并行数据字节的最高有效位。如果将二进制数据写入 /dev/short0,则会生成多个中断。不过,将 ASCII 文本写入端口不会产生任何中断,因为 ASCII 字符集没有具有最高位集的条目。

Pin 9 is the most significant bit of the parallel data byte. If you write binary data to /dev/short0, you generate several interrupts. Writing ASCII text to the port won't generate any interrupts, though, because the ASCII character set has no entries with the top bit set.

如果您不想将引脚连接在一起,但手头有打印机,则可以使用真实打印机运行示例中断处理程序,如下所示。但是,请注意,我们引入的探测功能取决于引脚 9 和 10 之间的跳线是否就位,您需要它来使用我们的代码进行探测实验。

If you'd rather avoid wiring pins together, but you do have a printer at hand, you can run the sample interrupt handler using a real printer, as shown later. However, note that the probing functions we introduce depend on the jumper between pin 9 and 10 being in place, and you need it to experiment with probing using our code.

安装中断处理程序

Installing an Interrupt Handler

如果您想实际“查看”正在生成的中断,仅写入硬件设备是不够的;必须在系统中配置软件处理程序。如果 Linux 内核没有被告知等待您的中断,它只会确认并忽略它。

If you want to actually "see" interrupts being generated, writing to the hardware device isn't enough; a software handler must be configured in the system. If the Linux kernel hasn't been told to expect your interrupt, it simply acknowledges and ignores it.

中断线是一种宝贵且通常有限的资源,特别是当只有 15 或 16 条中断线时。内核保留了中断线的注册表,类似于 I/O 端口的注册表。模块应在使用中断通道(或 IRQ,用于中断请求)之前请求它,并在完成后释放它。在许多情况下,模块也应该能够与其他驱动程序共享中断线,正如我们将看到的。<linux/interrupt.h>中声明的以下函数实现了中断注册接口:

Interrupt lines are a precious and often limited resource, particularly when there are only 15 or 16 of them. The kernel keeps a registry of interrupt lines, similar to the registry of I/O ports. A module is expected to request an interrupt channel (or IRQ, for interrupt request) before using it and to release it when finished. In many situations, modules are also expected to be able to share interrupt lines with other drivers, as we will see. The following functions, declared in <linux/interrupt.h>, implement the interrupt registration interface:

int request_irq(无符号 int irq,
                irqreturn_t (*handler)(int, void *, struct pt_regs *),
                无符号长旗,
                const char *dev_name,
                无效*dev_id);

void free_irq(unsigned int irq, void *dev_id);
int request_irq(unsigned int irq,
                irqreturn_t (*handler)(int, void *, struct pt_regs *),
                unsigned long flags, 
                const char *dev_name,
                void *dev_id);

void free_irq(unsigned int irq, void *dev_id);

像往常一样,从request_irq返回到请求函数的值要么0指示成功,要么指示负错误代码。-EBUSY该函数返回信号以表明另一个驱动程序已在使用所请求的中断线的情况并不罕见。函数的参数如下:

The value returned from request_irq to the requesting function is either 0 to indicate success or a negative error code, as usual. It's not uncommon for the function to return -EBUSY to signal that another driver is already using the requested interrupt line. The arguments to the functions are as follows:

unsigned int irq
unsigned int irq

正在请求的中断号。

The interrupt number being requested.

irqreturn_t (*handler)(int, void *, struct pt_regs *)
irqreturn_t (*handler)(int, void *, struct pt_regs *)

指向正在安装的处理函数的指针。我们将在本章后面讨论该函数的参数及其返回值。

The pointer to the handling function being installed. We discuss the arguments to this function and its return value later in this chapter.

unsigned long flags
unsigned long flags

正如您所期望的,与中断管理相关的选项位掩码(稍后描述)。

As you might expect, a bit mask of options (described later) related to interrupt management.

const char *dev_name
const char *dev_name

传递给request_irq 的字符串在/proc/interrupts中用于显示中断的所有者(请参阅下一节)。

The string passed to request_irq is used in /proc/interrupts to show the owner of the interrupt (see the next section).

void *dev_id
void *dev_id

用于共享中断线的指针。它是释放中断线时使用的唯一标识符,驱动程序也可以使用它来指向其自己的私有数据区域(以识别哪个设备正在中断)。如果中断不是共享的,dev_id可以设置为NULL,但无论如何使用此项来指向设备结构是一个好主意。dev_id我们将在第 10.3 节中看到 的实际用途。

Pointer used for shared interrupt lines. It is a unique identifier that is used when the interrupt line is freed and that may also be used by the driver to point to its own private data area (to identify which device is interrupting). If the interrupt is not shared, dev_id can be set to NULL, but it a good idea anyway to use this item to point to the device structure. We'll see a practical use for dev_id in Section 10.3.

可以设置的位flags如下:

The bits that can be set in flags are as follows:

SA_INTERRUPT
SA_INTERRUPT

设置后,这表示“快速”中断处理程序。快速处理程序在当前处理器上禁用中断的情况下执行(该主题在第 10.2.3 节中介绍)。

When set, this indicates a "fast" interrupt handler. Fast handlers are executed with interrupts disabled on the current processor (the topic is covered in the Section 10.2.3).

SA_SHIRQ
SA_SHIRQ

该位表示中断可以在设备之间共享。第 10.5 节概述了共享的概念。

This bit signals that the interrupt can be shared between devices. The concept of sharing is outlined in Section 10.5.

SA_SAMPLE_RANDOM
SA_SAMPLE_RANDOM

该位指示生成的中断可以贡献给/dev/random和使用的熵池/dev/urandom。这些设备在读取时返回真正的随机数,旨在帮助应用程序软件选择安全密钥进行加密。这些随机数是从由各种随机事件贡献的熵池中提取的。如果您的设备在真正随机的时间生成中断,则应该设置此标志。另一方面,如果您的中断是可预测的(例如,图像采集卡的垂直消隐),则该标志不值得设置 - 无论如何它不会对系统熵产生影响。可能受到攻击者影响的设备不应设置此标志;例如,网络驱动程序可能会受到来自外部的可预测数据包时序的影响,并且不应该对熵池做出贡献。drivers/char/random.c了解更多信息。

This bit indicates that the generated interrupts can contribute to the entropy pool used by /dev/random and /dev/urandom. These devices return truly random numbers when read and are designed to help application software choose secure keys for encryption. Such random numbers are extracted from an entropy pool that is contributed by various random events. If your device generates interrupts at truly random times, you should set this flag. If, on the other hand, your interrupts are predictable (for example, vertical blanking of a frame grabber), the flag is not worth setting—it wouldn't contribute to system entropy anyway. Devices that could be influenced by attackers should not set this flag; for example, network drivers can be subjected to predictable packet timing from outside and should not contribute to the entropy pool. See the comments in drivers/char/random.c for more information.

中断处理程序可以安装在驱动程序中 初始化或首次打开设备时。虽然从模块的初始化函数中安装中断处理程序听起来是个好主意,但通常不是,特别是如果您的设备不共享中断。由于中断线的数量是有限的,因此您不想浪费它们。您很容易就会发现计算机中的设备数量多于中断数量。如果模块在初始化时请求 IRQ,它将阻止任何其他驱动程序使用该中断,即使持有该中断的设备从未使用过。另一方面,在设备打开时请求中断允许某些资源共享。

The interrupt handler can be installed either at driver initialization or when the device is first opened. Although installing the interrupt handler from within the module's initialization function might sound like a good idea, it often isn't, especially if your device does not share interrupts. Because the number of interrupt lines is limited, you don't want to waste them. You can easily end up with more devices in your computer than there are interrupts. If a module requests an IRQ at initialization, it prevents any other driver from using the interrupt, even if the device holding it is never used. Requesting the interrupt at device open, on the other hand, allows some sharing of resources.

例如,只要您不同时使用这两个设备,就可以在与调制解调器相同的中断上运行图像采集卡。用户在系统启动时加载特殊设备的模块是很常见的,即使该设备很少使用。数据采集​​小工具可能使用与第二个串行端口相同的中断。虽然在数据采集过程中避免连接到 Internet 服务提供商 (ISP) 并不难,但为了使用调制解调器而被迫卸载模块确实令人不愉快。

It is possible, for example, to run a frame grabber on the same interrupt as a modem, as long as you don't use the two devices at the same time. It is quite common for users to load the module for a special device at system boot, even if the device is rarely used. A data acquisition gadget might use the same interrupt as the second serial port. While it's not too hard to avoid connecting to your Internet service provider (ISP) during data acquisition, being forced to unload a module in order to use the modem is really unpleasant.

调用request_irq 的正确位置是在设备首次打开时,在指示硬件生成中断之前。调用free_irq 的地方是最后一次关闭设备时, 硬件被告知不要再中断处理器之后。此技术的缺点是您需要保留每个设备的打开计数,以便您知道何时可以禁用中断。

The correct place to call request_irq is when the device is first opened, before the hardware is instructed to generate interrupts. The place to call free_irq is the last time the device is closed, after the hardware is told not to interrupt the processor any more. The disadvantage of this technique is that you need to keep a per-device open count so that you know when interrupts can be disabled.

尽管有这样的讨论,但 Short在加载时请求其中断线。这样做是为了让您可以运行测试程序,而无需运行额外的进程来保持设备打开。因此, short从其初始化函数 ( short_init )中请求中断,而不是像真正的设备驱动程序那样在Short_open中执行 。

This discussion notwithstanding, short requests its interrupt line at load time. This was done so that you can run the test programs without having to run an extra process to keep the device open. short, therefore, requests the interrupt from within its initialization function (short_init) instead of doing it in short_open, as a real device driver would.

以下代码请求的中断是short_irq。变量的实际分配(即,确定使用哪个 IRQ)将在后面显示,因为它与当前的讨论无关。short_base是正在使用的并行接口的基本 I/O 地址;接口的寄存器2被写入 启用中断报告。

The interrupt requested by the following code is short_irq. The actual assignment of the variable (i.e., determining which IRQ to use) is shown later, since it is not relevant to the current discussion. short_base is the base I/O address of the parallel interface being used; register 2 of the interface is written to enable interrupt reporting.

如果(短中断 >= 0){
    结果 = request_irq(short_irq,short_interrupt,
            SA_INTERRUPT,“短”,NULL);
   如果(结果){
        printk(KERN_INFO "短: 无法获得分配的 irq %i\n",
                短中断);
        短中断=-1;
    }
    else { /* 实际上启用它——假设这是*一个并行端口 */
        outb(0x10,short_base+2);
    }
}
if (short_irq >= 0) {
    result = request_irq(short_irq, short_interrupt,
            SA_INTERRUPT, "short", NULL);
   if (result) {
        printk(KERN_INFO "short: can't get assigned irq %i\n",
                short_irq);
        short_irq = -1;
    }
    else { /* actually enable it -- assume this *is* a parallel port */
        outb(0x10,short_base+2);
    }
}

代码显示,正在安装的处理程序是一个快速处理程序(SA_INTERRUPT),不支持中断共享(SA_SHIRQ缺少),并且不会对系统熵产生影响(SA_SAMPLE_RANDOM也缺少)。然后,outb调用 启用并行端口的中断报告。

The code shows that the handler being installed is a fast handler (SA_INTERRUPT), doesn't support interrupt sharing (SA_SHIRQ is missing), and doesn't contribute to system entropy (SA_SAMPLE_RANDOM is missing, too). The outb call then enables interrupt reporting for the parallel port.

无论如何,i386 和 x86_64 架构定义了一个用于查询中断线可用性的函数:

For what it's worth, the i386 and x86_64 architectures define a function for querying the availability of an interrupt line:

int can_request_irq(unsigned int irq, unsigned long flags);
int can_request_irq(unsigned int irq, unsigned long flags);

如果尝试分配给定中断成功,则该函数返回一个非零值。但请注意,在调用can_request_irqrequest_irq之间,事情总是可能发生变化 。

This function returns a nonzero value if an attempt to allocate the given interrupt succeeds. Note, however, that things can always change between calls to can_request_irq and request_irq.

/proc 接口

The /proc Interface

每当硬件中断到达处理器时,内部计数器就会递增,从而提供一种检查设备是否按预期工作的方法。报告的中断显示在 /proc/interrupts中。以下快照是在双处理器 Pentium 系统上拍摄的:

Whenever a hardware interrupt reaches the processor, an internal counter is incremented, providing a way to check whether the device is working as expected. Reported interrupts are shown in /proc/interrupts. The following snapshot was taken on a two-processor Pentium system:

root@montalcino:/bike/corbet/write/ldd3/src/short#m /proc/interrupts
           CPU0 CPU1       
  0: 4848108 34 IO-APIC 边沿定时器
  2: 0 0 XT-PIC 级联
  8: 3 1 IO-APIC-边缘 rtc
 10: 4335 1 IO-APIC 级 aic7xxx
 11: 8903 0 IO-APIC 级 uhci_hcd
 12: 49 1 IO-APIC-边缘 i8042
国家管理机构: 0 0
LOC: 4848187 4848186
错误:0
管理信息系统:0
root@montalcino:/bike/corbet/write/ldd3/src/short# m /proc/interrupts
           CPU0       CPU1       
  0:    4848108         34    IO-APIC-edge  timer
  2:          0          0          XT-PIC  cascade
  8:          3          1    IO-APIC-edge  rtc
 10:       4335          1   IO-APIC-level  aic7xxx
 11:       8903          0   IO-APIC-level  uhci_hcd
 12:         49          1    IO-APIC-edge  i8042
NMI:          0          0 
LOC:    4848187    4848186 
ERR:          0
MIS:          0

第一列是 IRQ 号。您可以从缺少的 IRQ 中看到该文件仅显示与已安装处理程序相对应的中断。例如,第一个串行端口(使用中断号 4)未显示,表明调制解调器未被使用。事实上,即使调制解调器之前已使用过但在快照时并未使用,它也不会显示在文件中;串行端口表现良好,并在设备关闭时释放其中断处理程序。

The first column is the IRQ number. You can see from the IRQs that are missing that the file shows only interrupts corresponding to installed handlers. For example, the first serial port (which uses interrupt number 4) is not shown, indicating that the modem isn't being used. In fact, even if the modem had been used earlier but wasn't in use at the time of the snapshot, it would not show up in the file; the serial ports are well behaved and release their interrupt handlers when the device is closed.

/ proc/interrupts显示显示了系统上每个 CPU 已传送的中断数量。从输出中可以看出,Linux 内核通常在第一个 CPU 上处理中断,作为最大化缓存局部性的一种方式。[ 1 ]最后两列给出了处理中断的可编程中断控制器的信息(驱动程序编写者不需要担心),以及已注册处理程序的设备的名称。中断(如request_irqdev_name的参数 中指定)。

The /proc/interrupts display shows how many interrupts have been delivered to each CPU on the system. As you can see from the output, the Linux kernel generally handles interrupts on the first CPU as a way of maximizing cache locality.[1] The last two columns give information on the programmable interrupt controller that handles the interrupt (and that a driver writer does not need to worry about), and the name(s) of the device(s) that have registered handlers for the interrupt (as specified in the dev_name argument to request_irq).

/proc树包含另一个与中断相关的文件/proc/stat;有时您会发现一个文件更有用,有时您会更喜欢另一个文件。 /proc/stat记录有关系统活动的一些低级统计信息,包括(但不限于)自系统启动以来收到的中断数量。统计数据的每一行都以一个文本字符串开头,该文本字符串是该行的关键;标记intr就是我们要寻找的东西。以下(截断的)快照是在上一张快照之后不久拍摄的:

The /proc tree contains another interrupt-related file, /proc/stat; sometimes you'll find one file more useful and sometimes you'll prefer the other. /proc/stat records several low-level statistics about system activity, including (but not limited to) the number of interrupts received since system boot. Each line of stat begins with a text string that is the key to the line; the intr mark is what we are looking for. The following (truncated) snapshot was taken shortly after the previous one:

国际 5167833 5154006 2 0 2 4907 0 2 68 4 0 4406 9291 50 0 0
intr 5167833 5154006 2 0 2 4907 0 2 68 4 0 4406 9291 50 0 0

第一个数字是所有中断的总数,而其他每个数字代表一条 IRQ 线,从中断开始0。系统中所有处理器的所有计数都被求和。此快照显示,即使当前没有 安装处理程序,中断号 4 已使用 4907 次。如果您正在测试的驱动程序在每个打开和关闭周期获取和释放中断,您可能会发现/proc/stat/proc/interrupts更有用。

The first number is the total of all interrupts, while each of the others represents a single IRQ line, starting with interrupt 0. All of the counts are summed across all processors in the system. This snapshot shows that interrupt number 4 has been used 4907 times, even though no handler is currently installed. If the driver you're testing acquires and releases the interrupt at each open and close cycle, you may find /proc/stat more useful than /proc/interrupts.

这两个文件之间的另一个区别是 中断不依赖于体系结构(也许除了最后的几行),而stat是;字段的数量取决于内核底层的硬件。可用中断的数量从 SPARC 上的 15 个到 IA-64 和其他一些系统上的 256 个不等。有趣的是,x86 上定义的中断数量当前为 224,而不是您所期望的 16;正如 中所解释的include/asm-i386/irq.h中所解释的,这取决于 Linux 使用架构限制而不是特定于实现的限制(例如老式 PC 中断控制器的 16 个中断源)。

Another difference between the two files is that interrupts is not architecture dependent (except, perhaps, for a couple of lines at the end), whereas stat is; the number of fields depends on the hardware underlying the kernel. The number of available interrupts varies from as few as 15 on the SPARC to as many as 256 on the IA-64 and a few other systems. It's interesting to note that the number of interrupts defined on the x86 is currently 224, not 16 as you may expect; this, as explained in include/asm-i386/irq.h, depends on Linux using the architectural limit instead of an implementation-specific limit (such as the 16 interrupt sources of the old-fashioned PC interrupt controller).

以下是 在 IA-64 系统上拍摄的/proc/interrupts的快照。正如您所看到的,除了常见中断源的硬件路由不同之外,输出与前面所示的 32 位系统非常相似。

The following is a snapshot of /proc/interrupts taken on an IA-64 system. As you can see, besides different hardware routing of common interrupt sources, the output is very similar to that from the 32-bit system shown earlier.

           CPU0 CPU1       
 27:1705 34141 IO-SAPIC 级 qla1280
 40: 0 0 SAPIC 性能监视器
 43: 913 6960 IO-SAPIC 级 eth0
 47: 26722 146 IO-SAPIC 级 USB-UHCI
 64: 3 6 IO-SAPIC-边缘 ide0
 80: 4 2 IO-SAPIC 边缘键盘
 89: 0 0 IO-SAPIC-edge PS/2 鼠标
239: 5606341 5606052 SAPIC 定时器
254:67575 52815 SAPIC IPI
国家管理机构: 0 0
错误:0
           CPU0       CPU1       
 27:       1705      34141  IO-SAPIC-level  qla1280
 40:          0          0           SAPIC  perfmon
 43:        913       6960  IO-SAPIC-level  eth0
 47:      26722        146  IO-SAPIC-level  usb-uhci
 64:          3          6   IO-SAPIC-edge  ide0
 80:          4          2   IO-SAPIC-edge  keyboard
 89:          0          0   IO-SAPIC-edge  PS/2 Mouse
239:    5606341    5606052           SAPIC  timer
254:      67575      52815           SAPIC  IPI
NMI:          0          0 
ERR:          0

自动检测IRQ号

Autodetecting the IRQ Number

对于驱动程序来说,初始化时最具挑战性的问题之一是如何确定设备将使用哪条 IRQ 线。驱动程序需要这些信息才能正确安装处理程序。即使程序员可以要求用户在加载时指定中断号,但这也是一个不好的做法,因为大多数时候用户不知道该号,要么是因为他没有配置跳线,要么是因为设备没有跳线。大多数用户希望他们的硬件“正常工作”,并且对中断号等问题不感兴趣。因此,自动检测中断号是驱动程序可用性的基本要求。

One of the most challenging problems for a driver at initialization time can be how to determine which IRQ line is going to be used by the device. The driver needs the information in order to correctly install the handler. Even though a programmer could require the user to specify the interrupt number at load time, this is a bad practice, because most of the time the user doesn't know the number, either because he didn't configure the jumpers or because the device is jumperless. Most users want their hardware to "just work" and are not interested in issues like interrupt numbers. So autodetection of the interrupt number is a basic requirement for driver usability.

有时自动检测取决于某些设备具有很少(如果有的话)很少改变的默认行为的知识。在这种情况下,驱动程序可能会假设应用默认值。这正是并行端口默认情况下的Short行为。实现很简单,如 Short本身所示:

Sometimes autodetection depends on the knowledge that some devices feature a default behavior that rarely, if ever, changes. In this case, the driver might assume that the default values apply. This is exactly how short behaves by default with the parallel port. The implementation is straightforward, as shown by short itself:

if (short_irq < 0) /* 尚未指定:强制使用默认值 */
    开关(短基数){
        情况 0x378:short_irq = 7;休息;
        情况 0x278:short_irq = 2;休息;
        情况 0x3bc:short_irq = 5;休息;
    }
if (short_irq < 0) /* not yet specified: force the default on */
    switch(short_base) {
        case 0x378: short_irq = 7; break;
        case 0x278: short_irq = 2; break;
        case 0x3bc: short_irq = 5; break;
    }

代码根据中断号分配选择基本 I/O 地址,同时允许用户在加载时使用以下内容覆盖默认值:

The code assigns the interrupt number according to the chosen base I/O address, while allowing the user to override the default at load time with something like:

insmod ./short.ko irq=x
insmod ./short.ko irq=x

short_base默认为0x378,因此short_irq默认为7

short_base defaults to 0x378, so short_irq defaults to 7.

有些设备在设计上更先进,只是“宣布”它们将使用哪个中断。在这种情况下,驱动程序通过从设备的 I/O 端口或 PCI 配置空间之一读取状态字节来检索中断号。当目标设备能够告诉驱动程序它将使用哪个中断时,自动检测 IRQ 号仅意味着探测该设备,而无需进行额外的工作来探测中断。幸运的是,大多数现代硬件都是这样工作的。例如,PCI 标准通过要求外围设备声明它们将使用什么中断线来解决这个问题。PCI 标准将在第 12 章中讨论。

Some devices are more advanced in design and simply "announce" which interrupt they're going to use. In this case, the driver retrieves the interrupt number by reading a status byte from one of the device's I/O ports or PCI configuration space. When the target device is one that has the ability to tell the driver which interrupt it is going to use, autodetecting the IRQ number just means probing the device, with no additional work required to probe the interrupt. Most modern hardware works this way, fortunately; for example, the PCI standard solves the problem by requiring peripheral devices to declare what interrupt line(s) they are going to use. The PCI standard is discussed in Chapter 12.

不幸的是,并不是每个设备都对程序员友好,并且自动检测可能需要一些探测。该技术非常简单:驱动程序告诉设备生成中断并观察发生了什么。如果一切顺利,则只有一根中断线被激活。

Unfortunately, not every device is programmer friendly, and autodetection might require some probing. The technique is quite simple: the driver tells the device to generate interrupts and watches what happens. If everything goes well, only one interrupt line is activated.

尽管探测在理论上很简单,但实际的实现可能尚不清楚。我们研究两种执行任务的方法:调用内核定义的辅助函数和实现我们自己的版本。

Although probing is simple in theory, the actual implementation might be unclear. We look at two ways to perform the task: calling kernel-defined helper functions and implementing our own version.

内核辅助探测

Kernel-assisted probing

Linux 内核提供了一个低级工具来探测中断号。它仅适用于非共享中断,但大多数能够在共享中断模式下工作的硬件都提供了更好的方法来查找配置的中断号。该设施由两个函数组成,在<linux/interrupt.h>中声明(它也描述了探测机制):

The Linux kernel offers a low-level facility for probing the interrupt number. It works for only nonshared interrupts, but most hardware that is capable of working in a shared interrupt mode provides better ways of finding the configured interrupt number anyway. The facility consists of two functions, declared in <linux/interrupt.h> (which also describes the probing machinery):

unsigned long probe_irq_on(void);
unsigned long probe_irq_on(void);

该函数返回未分配中断的位掩码。驱动程序必须保留返回的位掩码,并稍后将其传递给probe_irq_off。在此调用之后,驱动程序应安排其设备生成至少一个中断。

This function returns a bit mask of unassigned interrupts. The driver must preserve the returned bit mask, and pass it to probe_irq_off later. After this call, the driver should arrange for its device to generate at least one interrupt.

int probe_irq_off(unsigned long);
int probe_irq_off(unsigned long);

设备请求中断后,驱动程序调用此函数,并将先前由 probe_irq_on返回的位掩码作为其参数传递。probe_irq_off返回“probe_on”之后发出的中断号。如果没有发生中断,0则返回(因此, 0无法探测 IRQ,但无论如何,自定义设备都不能在任何受支持的体系结构上使用它)。如果发生多个中断(不明确检测),probe_irq_off 将返回负值。

After the device has requested an interrupt, the driver calls this function, passing as its argument the bit mask previously returned by probe_irq_on. probe_irq_off returns the number of the interrupt that was issued after "probe_on." If no interrupts occurred, 0 is returned (therefore, IRQ 0 can't be probed for, but no custom device can use it on any of the supported architectures anyway). If more than one interrupt occurred (ambiguous detection), probe_irq_off returns a negative value.

程序员应该小心地 在调用probe_irq_on之后启用设备上的中断,并在调用probe_irq_off之前禁用它们。此外,您必须记住在probe_irq_off之后服务设备中的挂起中断 。

The programmer should be careful to enable interrupts on the device after the call to probe_irq_on and to disable them before calling probe_irq_off. Additionally, you must remember to service the pending interrupt in your device after probe_irq_off.

模块演示了如何使用此类探测。如果使用 加载模块probe=1,则执行以下代码来检测中断线,前提是并行连接器的引脚 9 和 10 绑定在一起:

The short module demonstrates how to use such probing. If you load the module with probe=1, the following code is executed to detect your interrupt line, provided pins 9 and 10 of the parallel connector are bound together:

整数计数=0;
做 {
    无符号长掩码;

    掩码=probe_irq_on();
    outb_p(0x10,short_base+2); /* 启用报告 */
    outb_p(0x00,short_base); /* 清除该位 */
    outb_p(0xFF,short_base); /* 设置位:中断!*/
    outb_p(0x00,short_base+2); /* 禁用报告 */
    乌德莱(5);/* 给它一些时间 */
    短中断=probe_irq_off(掩码);

    if (short_irq == 0) { /* 没有一个?*/
        printk(KERN_INFO "短: 探测器没有报告 irq\n");
        短中断=-1;
    }
    /*
     * 如果有多条线路被激活,则结果为
     * 消极的。我们应该服务中断(不需要 lpt 端口)
     * 并再次循环。最多循环五次,然后放弃
     */
while (short_irq < 0 && count++ < 5);
如果(短中断 < 0)
    printk("短:探测失败%i次,放弃\n", count);
int count = 0;
do {
    unsigned long mask;

    mask = probe_irq_on(  );
    outb_p(0x10,short_base+2); /* enable reporting */
    outb_p(0x00,short_base);   /* clear the bit */
    outb_p(0xFF,short_base);   /* set the bit: interrupt! */
    outb_p(0x00,short_base+2); /* disable reporting */
    udelay(5);  /* give it some time */
    short_irq = probe_irq_off(mask);

    if (short_irq =  = 0) { /* none of them? */
        printk(KERN_INFO "short: no irq reported by probe\n");
        short_irq = -1;
    }
    /*
     * if more than one line has been activated, the result is
     * negative. We should service the interrupt (no need for lpt port)
     * and loop over again. Loop at most five times, then give up
     */
} while (short_irq < 0 && count++ < 5);
if (short_irq < 0)
    printk("short: probe failed %i times, giving up\n", count);

请注意在调用 probe_irq_off之前使用udelay。根据处理器的速度,您可能需要等待一小段时间才能真正传递中断时间。

Note the use of udelay before calling probe_irq_off. Depending on the speed of your processor, you may have to wait for a brief period to give the interrupt time to actually be delivered.

探索可能是一项漫长的任务。虽然短期情况并非如此 ,但探测图像采集卡需要至少 20 毫秒的延迟(这是处理器的年龄),而其他设备可能需要更长的时间。因此,最好在模块初始化时仅探测一次中断线,无论您是在设备打开时(如您应该的那样)还是在初始化函数中(不建议这样做)安装处理程序。

Probing might be a lengthy task. While this is not true for short, probing a frame grabber, for example, requires a delay of at least 20 ms (which is ages for the processor), and other devices might take even longer. Therefore, it's best to probe for the interrupt line only once, at module initialization, independently of whether you install the handler at device open (as you should) or within the initialization function (which is not recommended).

有趣的是,在某些平台(PowerPC、M68k、大多数 MIPS 实现以及两个 SPARC 版本)上,探测是不必要的,因此,前面的函数只是空占位符,有时被称为“无用的 ISA 废话”。在其他平台上,探测仅针对 ISA 设备实现。无论如何,大多数体系结构都定义了函数(即使它们是空的)以简化现有设备驱动程序的移植。

It's interesting to note that on some platforms (PowerPC, M68k, most MIPS implementations, and both SPARC versions) probing is unnecessary, and, therefore, the previous functions are just empty placeholders, sometimes called "useless ISA nonsense." On other platforms, probing is implemented only for ISA devices. Anyway, most architectures define the functions (even if they are empty) to ease porting existing device drivers.

自己动手试探

Do-it-yourself probing

探测也可以在驱动程序本身中实现,而不会有太多麻烦。这是一个必须实现自己的探测的罕见驱动程序,但了解它的工作原理可以让您深入了解该过程。为此,如果模块加载了probe=2.

Probing can also be implemented in the driver itself without too much trouble. It is a rare driver that must implement its own probing, but seeing how it works gives some insight into the process. To that end, the short module performs do-it-yourself detection of the IRQ line if it is loaded with probe=2.

该机制与前面描述的机制相同:启用所有未使用的中断,然后等待,看看会发生什么。然而,我们可以利用我们对该设备的了解。通常,设备可以配置为使用一组 3 个或 4 个 IRQ 编号中的一个;仅探测这些 IRQ 使我们能够检测到正确的 IRQ,而无需测试所有可能的 IRQ。

The mechanism is the same as the one described earlier: enable all unused interrupts, then wait and see what happens. We can, however, exploit our knowledge of the device. Often a device can be configured to use one IRQ number from a set of three or four; probing just those IRQs enables us to detect the right one, without having to test for all possible IRQs.

简短 实现假设3579是唯一可能的 IRQ 值。这些数字实际上是一些并行设备允许您选择的值。

The short implementation assumes that 3, 5, 7, and 9 are the only possible IRQ values. These numbers are actually the values that some parallel devices allow you to select.

以下代码通过测试所有“可能的”中断并查看发生的情况进行探测。该trials数组列出了要尝试的 IRQ,并0作为结束标记;该tried数组用于跟踪该驱动程序实际注册了哪些处理程序。

The following code probes by testing all "possible" interrupts and looking at what happens. The trials array lists the IRQs to try and has 0 as the end marker; the tried array is used to keep track of which handlers have actually been registered by this driver.

int 试验[ ] = {3, 5, 7, 9, 0};
int 尝试过[ ] = {0, 0, 0, 0, 0};
int i,计数 = 0;

/*
 * 为所有可能的线路安装探测处理程序。记住
 * 结果(0 表示成功,或 -EBUSY)以便仅释放
 * 已获得什么
 */
for (i = 0; 试验[i]; i++)
    尝试过[i] = request_irq(试验[i],short_probing,
            SA_INTERRUPT,“短探测”,NULL);

做 {
    短中断=0;/* 还没有得到 */
    outb_p(0x10,short_base+2); /* 使能够 */
    outb_p(0x00,short_base);
    outb_p(0xFF,short_base); /* 切换位 */
    outb_p(0x00,short_base+2); /* 禁用 */
    乌德莱(5);/* 给它一些时间 */

    /* 该值已由处理程序设置 */
    if (short_irq == 0) { /* 没有一个?*/
        printk(KERN_INFO "短: 探测器没有报告 irq\n");
    }
    /*
     * 如果有多条线路被激活,则结果为
     * 消极的。我们应该服务中断(但是 lpt 端口
     * 不需要它)并再次循环。最多做5次
     */
while (short_irq <=0 && count++ < 5);

/* 循环结束,卸载处理程序 */
for (i = 0; 试验[i]; i++)
    如果(尝试[i] == 0)
        free_irq(试验[i], NULL);

如果(短中断 < 0)
    printk("短:探测失败%i次,放弃\n", count);
int trials[  ] = {3, 5, 7, 9, 0};
int tried[  ]  = {0, 0, 0, 0, 0};
int i, count = 0;

/*
 * install the probing handler for all possible lines. Remember
 * the result (0 for success, or -EBUSY) in order to only free
 * what has been acquired
 */
for (i = 0; trials[i]; i++)
    tried[i] = request_irq(trials[i], short_probing,
            SA_INTERRUPT, "short probe", NULL);

do {
    short_irq = 0; /* none got, yet */
    outb_p(0x10,short_base+2); /* enable */
    outb_p(0x00,short_base);
    outb_p(0xFF,short_base); /* toggle the bit */
    outb_p(0x00,short_base+2); /* disable */
    udelay(5);  /* give it some time */

    /* the value has been set by the handler */
    if (short_irq =  = 0) { /* none of them? */
        printk(KERN_INFO "short: no irq reported by probe\n");
    }
    /*
     * If more than one line has been activated, the result is
     * negative. We should service the interrupt (but the lpt port
     * doesn't need it) and loop over again. Do it at most 5 times
     */
} while (short_irq <=0 && count++ < 5);

/* end of loop, uninstall the handler */
for (i = 0; trials[i]; i++)
    if (tried[i] =  = 0)
        free_irq(trials[i], NULL);

if (short_irq < 0)
    printk("short: probe failed %i times, giving up\n", count);

您可能事先不知道“可能的”IRQ 值是什么。在这种情况下,您需要探测所有空闲中断,而不是将自己限制在几个trials[ ]。要探测所有中断,您必须从 IRQ 探测0到 IRQ NR_IRQS-1,其中在<asm/irq.h>NR_IRQS中定义并且与平台相关。

You might not know in advance what the "possible" IRQ values are. In that case, you need to probe all the free interrupts, instead of limiting yourself to a few trials[ ]. To probe for all interrupts, you have to probe from IRQ 0 to IRQ NR_IRQS-1, where NR_IRQS is defined in <asm/irq.h> and is platform dependent.

现在我们只缺少探测处理程序本身。处理程序的作用是 short_irq根据实际接收到的中断进行更新。值0表示short_irq“还没有”,而负值表示“不明确”。选择这些值是为了与 probe_irq_off一致,并允许相同的代码在short.c中调用任一类型的探测。

Now we are missing only the probing handler itself. The handler's role is to update short_irq according to which interrupts are actually received. A 0 value in short_irq means "nothing yet," while a negative value means "ambiguous." These values were chosen to be consistent with probe_irq_off and to allow the same code to call either kind of probing within short.c.

irqreturn_t Short_probing(int irq, void *dev_id, struct pt_regs *regs)
{
    if (short_irq == 0) Short_irq = irq; /* 成立 */
    if (short_irq != irq) Short_irq = -irq; /* 模糊的 */
    返回IRQ_HANDLED;
}
irqreturn_t short_probing(int irq, void *dev_id, struct pt_regs *regs)
{
    if (short_irq =  = 0) short_irq = irq;    /* found */
    if (short_irq != irq) short_irq = -irq; /* ambiguous */
    return IRQ_HANDLED;
}

稍后将描述处理程序的参数。了解irq正在处理的中断应该足以理解刚刚显示的函数。

The arguments to the handler are described later. Knowing that irq is the interrupt being handled should be sufficient to understand the function just shown.

快速和慢速处理程序

Fast and Slow Handlers

旧版本的 Linux 内核煞费苦心地区分“快”中断和“慢”中断。快速中断是那些可以非常快地处理的中断,而处理慢速中断则需要更长的时间。慢速中断对处理器的要求可能很高,因此在处理中断时重新启用中断是值得的。否则,需要快速集中注意力的任务可能会被拖延太久。

Older versions of the Linux kernel took great pains to distinguish between "fast" and "slow" interrupts. Fast interrupts were those that could be handled very quickly, whereas handling slow interrupts took significantly longer. Slow interrupts could be sufficiently demanding of the processor, and it was worthwhile to reenable interrupts while they were being handled. Otherwise, tasks requiring quick attention could be delayed for too long.

在现代内核中,快中断和慢中断之间的大部分差异已经消失。只剩下一种情况:快速中断(使用该标志请求的中断 SA_INTERRUPT)在当前处理器上禁用所有其他中断的情况下执行。请注意,其他处理器仍然可以处理中断,尽管您永远不会看到两个处理器同时处理相同的 IRQ。

In modern kernels, most of the differences between fast and slow interrupts have disappeared. There remains only one: fast interrupts (those that were requested with the SA_INTERRUPT flag) are executed with all other interrupts disabled on the current processor. Note that other processors can still handle interrupts, although you will never see two processors handling the same IRQ at the same time.

那么,您的驱动程序应该使用哪种类型的中断?在现代系统上,SA_INTERRUPT仅用于少数特定情况,例如定时器中断。除非您有充分的理由在禁用其他中断的情况下运行中断处理程序,否则不应使用SA_INTERRUPT.

So, which type of interrupt should your driver use? On modern systems, SA_INTERRUPT is intended only for use in a few, specific situations such as timer interrupts. Unless you have a strong reason to run your interrupt handler with other interrupts disabled, you should not use SA_INTERRUPT.

此描述应该能让大多数读者满意,尽管对硬件有品味并且有一定计算机使用经验的人可能有兴趣更深入地了解。如果您不关心内部细节,可以跳到下一节。

This description should satisfy most readers, although someone with a taste for hardware and some experience with her computer might be interested in going deeper. If you don't care about the internal details, you can skip to the next section.

x86 上中断处理的内部结构

The internals of interrupt handling on the x86

此描述是从arch/i386/kernel/irq.carch/i386/kernel/apic.carch/i386/kernel/entry.Sarch/i386/kernel/i8259.cinclude/asm推断出来的-i386/hw_irq.h,因为它们出现在 2.6 内核中;尽管一般概念保持不变,但其他平台上的硬件细节有所不同。

This description has been extrapolated from arch/i386/kernel/irq.c, arch/i386/kernel/apic.c, arch/i386/kernel/entry.S, arch/i386/kernel/i8259.c, and include/asm-i386/hw_irq.h as they appear in the 2.6 kernels; although the general concepts remain the same, the hardware details differ on other platforms.

最低级别的中断处理可以在entry.S中找到,这是一个处理大部分机器级工作的汇编语言文件。通过一些汇编器技巧和一些宏,一段代码被分配给每个可能的中断。在每种情况下,代码都会将中断号压入堆栈并跳转到公共段,该段调用irq.c中定义的 do_IRQ

The lowest level of interrupt handling can be found in entry.S, an assembly-language file that handles much of the machine-level work. By way of a bit of assembler trickery and some macros, a bit of code is assigned to every possible interrupt. In each case, the code pushes the interrupt number on the stack and jumps to a common segment, which calls do_IRQ, defined in irq.c.

do_IRQ做的第一件事是确认中断,以便中断控制器可以继续执行其他操作。然后,它获得给定 IRQ 编号的自旋锁,从而防止任何其他 CPU 处理该 IRQ。它清除几个状态位(包括IRQ_WAITING我们将很快看到的一个被调用的状态位),然后查找该特定 IRQ 的处理程序。如果没有处理程序,则无事可做;自旋锁被释放,所有待处理的软件中断都被处理,并且do_IRQ 返回。

The first thing do_IRQ does is to acknowledge the interrupt so that the interrupt controller can go on to other things. It then obtains a spinlock for the given IRQ number, thus preventing any other CPU from handling this IRQ. It clears a couple of status bits (including one called IRQ_WAITING that we'll look at shortly) and then looks up the handler(s) for this particular IRQ. If there is no handler, there's nothing to do; the spinlock is released, any pending software interrupts are handled, and do_IRQ returns.

然而,通常情况下,如果设备正在中断,则至少也会为其 IRQ 注册一个处理程序。调用函数handle_IRQ_event来实际调用处理程序。如果处理程序属于慢速类型(SA_INTERRUPT未设置),则在硬件中重新启用中断,并调用处理程序。然后只需进行清理、运行软件中断并返回正常工作即可。“常规工作”很可能因中断而发生变化(例如,处理程序可以唤醒进程),因此从中断返回时发生的最后一件事可能是处理器的重新调度。

Usually, however, if a device is interrupting, there is at least one handler registered for its IRQ as well. The function handle_IRQ_event is called to actually invoke the handlers. If the handler is of the slow variety (SA_INTERRUPT is not set), interrupts are reenabled in the hardware, and the handler is invoked. Then it's just a matter of cleaning up, running software interrupts, and getting back to regular work. The "regular work" may well have changed as a result of an interrupt (the handler could wake_up a process, for example), so the last thing that happens on return from an interrupt is a possible rescheduling of the processor.

IRQ_WAITINGIRQ 的探测是通过为当前缺少处理程序的每个 IRQ设置状态位来完成的。当中断发生时,do_IRQ清除该位然后返回,因为没有注册处理程序。当驱动程序调用时, probe_irq_off只需要搜索该 IRQ 不再IRQ_WAITING设置。

Probing for IRQs is done by setting the IRQ_WAITING status bit for each IRQ that currently lacks a handler. When the interrupt happens, do_IRQ clears that bit and then returns, because no handler is registered. probe_irq_off, when called by a driver, needs to search for only the IRQ that no longer has IRQ_WAITING set.

实现处理程序

Implementing a Handler

到目前为止,我们已经学会了注册 一个中断处理程序,但不编写一个。实际上,处理程序并没有什么异常之处——它就是普通的 C 代码。

So far, we've learned to register an interrupt handler but not to write one. Actually, there's nothing unusual about a handler—it's ordinary C code.

唯一的特点是处理程序在中断时运行,因此它的功能受到一些限制。这些限制与我们在内核计时器中看到的限制相同。处理程序无法将数据传输到用户空间或从用户空间传输数据,因为它不在进程的上下文中执行。处理程序也不能执行任何会休眠的操作,例如调用wait_event、使用除 之外的任何内容分配内存 GFP_ATOMIC或锁定信号量。最后,处理程序无法调用Schedule

The only peculiarity is that a handler runs at interrupt time and, therefore, suffers some restrictions on what it can do. These restrictions are the same as those we saw with kernel timers. A handler can't transfer data to or from user space, because it doesn't execute in the context of a process. Handlers also cannot do anything that would sleep, such as calling wait_event, allocating memory with anything other than GFP_ATOMIC, or locking a semaphore. Finally, handlers cannot call schedule.

中断处理程序的作用是向其设备提供有关中断接收的反馈,并根据正在服务的中断的含义读取或写入数据。第一步通常包括清除接口板上的一个位;大多数硬件设备在其“中断挂起”位被清除之前不会生成其他中断。根据您的硬件工作方式,此步骤可能需要最后而不是首先执行;这里没有包罗万象的规则。有些设备不需要此步骤,因为它们没有“中断挂起”位;尽管并行端口是其中之一,但此类设备仍占少数。因此,做空者不必清除这一点。

The role of an interrupt handler is to give feedback to its device about interrupt reception and to read or write data according to the meaning of the interrupt being serviced. The first step usually consists of clearing a bit on the interface board; most hardware devices won't generate other interrupts until their "interrupt-pending" bit has been cleared. Depending on how your hardware works, this step may need to be performed last instead of first; there is no catch-all rule here. Some devices don't require this step, because they don't have an "interrupt-pending" bit; such devices are a minority, although the parallel port is one of them. For that reason, short does not have to clear such a bit.

中断处理程序的典型任务是,如果中断发出信号通知进程正在等待的事件(例如新数据的到达),则唤醒在设备上休眠的进程。

A typical task for an interrupt handler is awakening processes sleeping on the device if the interrupt signals the event they're waiting for, such as the arrival of new data.

继续以图像采集卡为例,进程可以通过连续读取设备来获取图像序列;读取调用会在读取每个帧之前阻塞,而中断处理程序在每个新帧到达时立即唤醒进程。这假设采集器中断处理器以发出每个新帧成功到达的信号。

To stick with the frame grabber example, a process could acquire a sequence of images by continuously reading the device; the read call blocks before reading each frame, while the interrupt handler awakens the process as soon as each new frame arrives. This assumes that the grabber interrupts the processor to signal successful arrival of each new frame.

程序员应该小心地编写一个在最短的时间内执行的例程,无论它是快速还是慢速处理程序。如果需要执行较长的计算,最好的方法是使用 tasklet 或工作队列在更安全的时间安排计算(我们将在第 10.4 节中了解如何以这种方式推迟工作。)

The programmer should be careful to write a routine that executes in a minimum amount of time, independent of its being a fast or slow handler. If a long computation needs to be performed, the best approach is to use a tasklet or workqueue to schedule computation at a safer time (we'll look at how work can be deferred in this manner in Section 10.4.)

简而言之,我们的示例代码通过调用do_gettimeofday并将当前时间打印到页面大小的循环缓冲区中来响应中断 。然后它会唤醒任何读取进程,因为现在有数据可供读取。

Our sample code in short responds to the interrupt by calling do_gettimeofday and printing the current time into a page-sized circular buffer. It then awakens any reading process, because there is now data available to be read.

irqreturn_t Short_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
    结构 timeval 电视;
    书面写作;

    do_gettimeofday(&tv);

        /* 写入一个16字节的记录。假设PAGE_SIZE是16的倍数*/
    写入 = sprintf((char *)short_head,"%08u.%06u\n",
            (int)(tv.tv_sec % 100000000), (int)(tv.tv_usec));
    BUG_ON(写入!= 16);
    Short_incr_bp(&short_head, 写入);
    wake_up_interruptible(&short_queue); /* 唤醒任何读进程 */
    返回IRQ_HANDLED;
}
irqreturn_t short_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
    struct timeval tv;
    int written;

    do_gettimeofday(&tv);

        /* Write a 16 byte record. Assume PAGE_SIZE is a multiple of 16 */
    written = sprintf((char *)short_head,"%08u.%06u\n",
            (int)(tv.tv_sec % 100000000), (int)(tv.tv_usec));
    BUG_ON(written != 16);
    short_incr_bp(&short_head, written);
    wake_up_interruptible(&short_queue); /* awake any reading process */
    return IRQ_HANDLED;
}

这段代码虽然简单,但却代表了中断处理程序的典型工作。它依次调用short_incr_bp,其定义如下:

This code, though simple, represents the typical job of an interrupt handler. It, in turn, calls short_incr_bp, which is defined as follows:

静态内联无效short_incr_bp(易失性无符号长*索引,int增量)
{
    无符号长新 = *索引 + 增量;
    障碍( ); /* 不要同时优化这两个 */
    *index = (new >= (short_buffer + PAGE_SIZE)) ? 短缓冲区:新的;
}
static inline void short_incr_bp(volatile unsigned long *index, int delta)
{
    unsigned long new = *index + delta;
    barrier(  );  /* Don't optimize these two together */
    *index = (new >= (short_buffer + PAGE_SIZE)) ? short_buffer : new;
}

该函数经过精心编写,可将指针包装到循环缓冲区中,而不会暴露错误的值。屏障调用是为了阻止编译器对函数的其他两行进行优化如果没有屏障,编译器可能会决定优化new变量并直接分配给*index. 在换行的情况下,该优化可能会在短时间内暴露索引的错误值。通过注意防止不一致的值对其他线程可见,我们可以安全地操作循环缓冲区指针而无需锁定。

This function has been carefully written to wrap a pointer into the circular buffer without ever exposing an incorrect value. The barrier call is there to block compiler optimizations across the other two lines of the function. Without the barrier, the compiler might decide to optimize out the new variable and assign directly to *index. That optimization could expose an incorrect value of the index for a brief period in the case where it wraps. By taking care to prevent in inconsistent value from ever being visible to other threads, we can manipulate the circular buffer pointers safely without locks.

用于读取中断时填充的缓冲区的设备文件是/dev/shortint。这个设备特殊文件和 /dev/shortprint没有在第 9 章中介绍,因为它的使用特定于中断处理。/dev/shortint的内部结构是专门为中断生成和报告而定制的。写入设备时每隔一个字节产生一个中断;读取设备给出报告每个中断的时间。

The device file used to read the buffer being filled at interrupt time is /dev/shortint. This device special file, together with /dev/shortprint, wasn't introduced in Chapter 9, because its use is specific to interrupt handling. The internals of /dev/shortint are specifically tailored for interrupt generation and reporting. Writing to the device generates one interrupt every other byte; reading the device gives the time when each interrupt was reported.

如果将并行连接器的引脚 9 和 10 连接在一起,则可以通过提高并行数据字节的高位来生成中断。这可以通过将二进制数据写入/dev/short0或将任何内容写入/dev/shortint来完成。[ 2 ]

If you connect together pins 9 and 10 of the parallel connector, you can generate interrupts by raising the high bit of the parallel data byte. This can be accomplished by writing binary data to /dev/short0 or by writing anything to /dev/shortint.[2]

以下代码实现了 / dev /shortint读写

The following code implements read and write for /dev/shortint:

ssize_t Short_i_read(结构文件*filp,char _ _user *buf,size_t计数,
     loff_t *f_pos)
{
    整数计数0;
    DEFINE_WAIT(等待);

    而(短头==短尾){
        准备等待(&short_queue, &wait, TASK_INTERRUPTIBLE);
        如果(短头==短尾)
            日程( );
        finish_wait(&short_queue, &wait);
        if (signal_pending (current)) /* 信号到达 */
            返回-ERESTARTSYS;/* 告诉 fs 层处理它 */
    }
    /* count0为可读数据字节数*/
    count0 = 短头 - 短尾;
    if (count0 < 0) /* 包装 */
        count0 = 短缓冲区 + PAGE_SIZE - 短尾;
    if (count0 < count) 计数 = count0;

    if (copy_to_user(buf, (char *)short_tail, count))
        返回-EFAULT;
    Short_incr_bp (&short_tail, 计数);
    返回计数;
}

ssize_t Short_i_write(结构文件*filp,const char _ _user *buf,size_t计数,
        loff_t *f_pos)
{
    写入的 int = 0, odd = *f_pos & 1;
    无符号长端口=short_base;/* 输出到并行数据锁存器 */
    无效*地址=(无效*)short_base;

    如果(使用内存){
        while(写<计数)
            iowrite8(0xff * ((++writing + odd) & 1), 地址);
    } 别的 {
        while(写<计数)
            outb(0xff * ((++书面 + 奇数) & 1), 端口);
    }

    *f_pos += 计数;
    书面返回;
}
ssize_t short_i_read (struct file *filp, char _ _user *buf, size_t count, 
     loff_t *f_pos)
{
    int count0;
    DEFINE_WAIT(wait);

    while (short_head =  = short_tail) {
        prepare_to_wait(&short_queue, &wait, TASK_INTERRUPTIBLE);
        if (short_head =  = short_tail)
            schedule(  );
        finish_wait(&short_queue, &wait);
        if (signal_pending (current))  /* a signal arrived */
            return -ERESTARTSYS; /* tell the fs layer to handle it */
    } 
    /* count0 is the number of readable data bytes */
    count0 = short_head - short_tail;
    if (count0 < 0) /* wrapped */
        count0 = short_buffer + PAGE_SIZE - short_tail;
    if (count0 < count) count = count0;

    if (copy_to_user(buf, (char *)short_tail, count))
        return -EFAULT;
    short_incr_bp (&short_tail, count);
    return count;
}

ssize_t short_i_write (struct file *filp, const char _ _user *buf, size_t count,
        loff_t *f_pos)
{
    int written = 0, odd = *f_pos & 1;
    unsigned long port = short_base; /* output to the parallel data latch */
    void *address = (void *) short_base;

    if (use_mem) {
        while (written < count)
            iowrite8(0xff * ((++written + odd) & 1), address);
    } else {
        while (written < count)
            outb(0xff * ((++written + odd) & 1), port);
    }

    *f_pos += count;
    return written;
}

另一个设备特殊文件/dev/shortprint使用并行端口驱动打印机;如果您想避免连接 D-25 连接器的引脚 9 和 10,则可以使用它。Shortprint 的写入实现使用 循环缓冲区来存储要打印的数据,而读取实现是刚刚显示的(因此您可以读取打印机吃掉每个字符所需的时间)。

The other device special file, /dev/shortprint, uses the parallel port to drive a printer; you can use it if you want to avoid connecting pins 9 and 10 of a D-25 connector. The write implementation of shortprint uses a circular buffer to store data to be printed, while the read implementation is the one just shown (so you can read the time your printer takes to eat each character).

为了支持打印机操作,中断处理程序对刚刚所示的中断处理程序进行了轻微修改,增加了如果有更多数据要传输则将下一个数据字节发送到打印机的功能。

In order to support printer operation, the interrupt handler has been slightly modified from the one just shown, adding the ability to send the next data byte to the printer if there is more data to transfer.

处理程序参数和返回值

Handler Arguments and Return Value

虽然肖特忽略了他们, 三个参数传递给中断处理程序: irqdev_idregs。让我们看看每个人的角色。

Though short ignores them, three arguments are passed to an interrupt handler: irq, dev_id, and regs. Let's look at the role of each.

中断号 ( int irq) 非常有用,因为您可以在日志消息(如果有)中打印信息。第二个参数void *dev_id是一种客户端数据;一个void *参数被传递给 request_irq,然后当中断发生时,这个相同的指针将作为参数传递回处理程序。您通常会在 中传递指向设备数据结构的指针dev_id,因此管理同一设备的多个实例的驱动程序不需要在中断处理程序中添加任何额外代码来找出哪个设备负责当前中断事件。

The interrupt number (int irq) is useful as information you may print in your log messages, if any. The second argument, void *dev_id, is a sort of client data; a void * argument is passed to request_irq, and this same pointer is then passed back as an argument to the handler when the interrupt happens. You usually pass a pointer to your device data structure in dev_id, so a driver that manages several instances of the same device doesn't need any extra code in the interrupt handler to find out which device is in charge of the current interrupt event.

中断处理程序中参数的典型用法如下:

Typical use of the argument in an interrupt handler is as follows:

静态 irqreturn_t Sample_interrupt(int irq, void *dev_id, struct pt_regs
                             *规则)
{
    结构样本_dev *dev = dev_id;

    /* 现在 `dev' 指向正确的硬件项 */
    /* .... */
}
static irqreturn_t sample_interrupt(int irq, void *dev_id, struct pt_regs 
                             *regs)
{
    struct sample_dev *dev = dev_id;

    /* now `dev' points to the right hardware item */
    /* .... */
}

与此处理程序关联的典型开放代码如下所示:

The typical open code associated with this handler looks like this:

静态无效sample_open(结构inode * inode,结构文件* filp)
{
    struct Sample_dev *dev = hwinfo + MINOR(inode->i_rdev);
    request_irq(dev->irq,sample_interrupt,
                0 /* 标志 */, "sample", dev /* dev_id */);
    /*....*/
    返回0;
}
static void sample_open(struct inode *inode, struct file *filp)
{
    struct sample_dev *dev = hwinfo + MINOR(inode->i_rdev);
    request_irq(dev->irq, sample_interrupt,
                0 /* flags */, "sample", dev /* dev_id */);
    /*....*/
    return 0;
}

最后一个参数struct pt_regs *regs很少使用。它保存处理器输入中断代码之前处理器上下文的快照。寄存器可用于监控和调试;常规设备驱动程序任务通常不需要它们。

The last argument, struct pt_regs *regs, is rarely used. It holds a snapshot of the processor's context before the processor entered interrupt code. The registers can be used for monitoring and debugging; they are not normally needed for regular device driver tasks.

中断处理程序应返回一个值,指示是否确实有要处理的中断。如果处理程序发现其设备确实需要关注,则应返回IRQ_HANDLED;否则返回值应该是IRQ_NONE. 您还可以使用此宏生成返回值:

Interrupt handlers should return a value indicating whether there was actually an interrupt to handle. If the handler found that its device did, indeed, need attention, it should return IRQ_HANDLED; otherwise the return value should be IRQ_NONE. You can also generate the return value with this macro:

IRQ_RETVAL(已处理)
IRQ_RETVAL(handled)

handled如果您能够处理中断,则其中是非零。内核使用返回值来检测和抑制虚假中断。如果您的设备无法判断它是否真的中断,您应该返回IRQ_HANDLED

where handled is nonzero if you were able to handle the interrupt. The return value is used by the kernel to detect and suppress spurious interrupts. If your device gives you no way to tell whether it really interrupted, you should return IRQ_HANDLED.

启用和禁用中断

Enabling and Disabling Interrupts

有时,一个 设备驱动程序必须在一段(希望很短的)时间内阻止中断的传送。通常,必须在保持自旋锁的同时阻止中断,以避免系统死锁。有多种方法可以禁用不涉及自旋锁的中断。但在我们讨论它们之前,请注意,即使在设备驱动程序中,禁用中断也应该是一项相对罕见的活动,并且这种技术永远不应该用作驱动程序中的互斥机制。

There are times when a device driver must block the delivery of interrupts for a (hopefully short) period of time. Often, interrupts must be blocked while holding a spinlock to avoid deadlocking the system. There are ways of disabling interrupts that do not involve spinlocks. But before we discuss them, note that disabling interrupts should be a relatively rare activity, even in device drivers, and this technique should never be used as a mutual exclusion mechanism within a driver.

禁用单个中断

Disabling a single interrupt

有时(但很少!)驱动程序需要禁用特定中断线的中断传递。内核为此目的提供了三个函数,全部在 <asm/irq.h>中声明。这些函数是内核 API 的一部分,因此我们描述它们,但在大多数驱动程序中不鼓励使用它们。除此之外,您无法禁用共享中断线,并且在现代系统上,共享中断是常态。也就是说,他们在这里:

Sometimes (but rarely!) a driver needs to disable interrupt delivery for a specific interrupt line. The kernel offers three functions for this purpose, all declared in <asm/irq.h>. These functions are part of the kernel API, so we describe them, but their use is discouraged in most drivers. Among other things, you cannot disable shared interrupt lines, and, on modern systems, shared interrupts are the norm. That said, here they are:

无效disable_irq(int irq);
无效disable_irq_nosync(int irq);
无效enable_irq(int irq);
void disable_irq(int irq);
void disable_irq_nosync(int irq);
void enable_irq(int irq);

irq调用任何这些函数都可能会更新可编程中断控制器 (PIC) 中指定的掩码,从而禁用或启用所有处理器上的指定 IRQ。对这些函数的调用可以嵌套——如果连续调用两次 disable_irq ,则在真正重新启用IRQ之前需要两次enable_irq调用。可以从中断处理程序调用这些函数,但在处理它时启用自己的 IRQ 通常不是一个好的做法。

Calling any of these functions may update the mask for the specified irq in the programmable interrupt controller (PIC), thus disabling or enabling the specified IRQ across all processors. Calls to these functions can be nested—if disable_irq is called twice in succession, two enable_irq calls are required before the IRQ is truly reenabled. It is possible to call these functions from an interrupt handler, but enabling your own IRQ while handling it is not usually good practice.

disable_irq不仅禁用给定的中断,而且还等待当前正在执行的中断处理程序(如果有)完成。请注意,如果调用disable_irq的线程持有中断处理程序所需的任何资源(例如自旋锁),系统可能会死锁。 disable_irq_nosyncdisable_irq 的不同 之处在于它立即返回。因此,使用disable_irq_nosync会快一点,但可能会让你的驱动程序面临竞争条件。

disable_irq not only disables the given interrupt but also waits for a currently executing interrupt handler, if any, to complete. Be aware that if the thread calling disable_irq holds any resources (such as spinlocks) that the interrupt handler needs, the system can deadlock. disable_irq_nosync differs from disable_irq in that it returns immediately. Thus, using disable_irq_nosync is a little faster but may leave your driver open to race conditions.

但为什么要禁用中断呢?继续讨论并行端口,让我们看看plip网络接口。plip设备 使用基本的并行端口来传输数据。由于只能从并行连接器读取五个位,因此它们被解释为四个数据位和一个时钟/握手信号。当发起方(发送数据包的接口)发送数据包的前四位时,时钟线升高,导致接收接口中断处理器。然后调用plip处理程序 来处理新到达的数据。

But why disable an interrupt? Sticking to the parallel port, let's look at the plip network interface. A plip device uses the bare-bones parallel port to transfer data. Since only five bits can be read from the parallel connector, they are interpreted as four data bits and a clock/handshake signal. When the first four bits of a packet are transmitted by the initiator (the interface sending the packet), the clock line is raised, causing the receiving interface to interrupt the processor. The plip handler is then invoked to deal with newly arrived data.

设备收到警报后,数据传输继续进行,使用握手线将新数据计时到接收接口(这可能不是最佳实现,但为了与使用并行端口的其他数据包驱动程序兼容)。如果接收接口必须为接收到的每个字节处理两个中断,那么性能将难以忍受。因此,驱动程序在接收数据包期间禁止中断;相反,使用轮询和延迟循环来引入数据。

After the device has been alerted, the data transfer proceeds, using the handshake line to clock new data to the receiving interface (this might not be the best implementation, but it is necessary for compatibility with other packet drivers using the parallel port). Performance would be unbearable if the receiving interface had to handle two interrupts for every byte received. Therefore, the driver disables the interrupt during the reception of the packet; instead, a poll-and-delay loop is used to bring in the data.

同样,由于从接收器到发送器的握手线用于确认数据接收,因此发送接口在数据包传输期间禁用其 IRQ 线。

Similarly, because the handshake line from the receiver to the transmitter is used to acknowledge data reception, the transmitting interface disables its IRQ line during packet transmission.

禁用所有中断

Disabling all interrupts

如果需要禁用所有中断怎么办?在 2.6 内核中,可以使用以下两个函数(在<asm/system.h>中定义)之一关闭当前处理器上的所有中断处理:

What if you need to disable all interrupts? In the 2.6 kernel, it is possible to turn off all interrupt handling on the current processor with either of the following two functions (which are defined in <asm/system.h>):

void local_irq_save(无符号长标志);
无效 local_irq_disable(void);
void local_irq_save(unsigned long flags);
void local_irq_disable(void);

将当前中断状态保存到 后,调用local_irq_saveflags将禁用当前处理器上的中断传递。注意,flags是直接传递,而不是通过指针传递。local_irq_disable关闭本地中断传送而不保存状态;仅当您知道其他地方尚未禁用中断时才应使用此版本。

A call to local_irq_save disables interrupt delivery on the current processor after saving the current interrupt state into flags. Note that flags is passed directly, not by pointer. local_irq_disable shuts off local interrupt delivery without saving the state; you should use this version only if you know that interrupts have not already been disabled elsewhere.

重新打开中断是通过以下方式完成的:

Turning interrupts back on is accomplished with:

void local_irq_restore(无符号长标志);
无效 local_irq_enable(void);
void local_irq_restore(unsigned long flags);
void local_irq_enable(void);

flags第一个版本恢复local_irq_save存储的状态,而 local_irq_enable无条件启用中断。与disable_irq不同 ,local_irq_disable不跟踪多个调用。如果调用链中的多个函数可能需要禁用中断,则应使用local_irq_save 。

The first version restores that state which was stored into flags by local_irq_save, while local_irq_enable enables interrupts unconditionally. Unlike disable_irq, local_irq_disable does not keep track of multiple calls. If more than one function in the call chain might need to disable interrupts, local_irq_save should be used.

在2.6内核中,没有办法全局禁用整个系统的所有中断。内核开发人员认为关闭所有中断的成本太高,并且在任何情况下都不需要该功能。如果您使用的是调用 clisti等函数的旧驱动程序,则需要先更新它以使用正确的锁定,然后才能使用 在2.6下工作。

In the 2.6 kernel, there is no way to disable all interrupts globally across the entire system. The kernel developers have decided that the cost of shutting off all interrupts is too high and that there is no need for that capability in any case. If you are working with an older driver that makes calls to functions such as cli and sti, you need to update it to use proper locking before it will work under 2.6.

上半部和下半部

Top and Bottom Halves

中断的主要问题之一 处理是如何在处理程序中执行冗长的任务。通常必须完成大量工作来响应设备中断,但中断处理程序需要快速完成并且不能长时间阻塞中断。这两种需求(工作和速度)相互冲突,使驱动程序编写者陷入了困境。

One of the main problems with interrupt handling is how to perform lengthy tasks within a handler. Often a substantial amount of work must be done in response to a device interrupt, but interrupt handlers need to finish up quickly and not keep interrupts blocked for long. These two needs (work and speed) conflict with each other, leaving the driver writer in a bit of a bind.

Linux(以及许多其他系统)通过将中断处理程序分成两半来解决这个问题。所谓的上半部分是实际响应中断的例程——您使用request_irq注册的例程 。下半部是一个由上半部安排在稍后、更安全的时间执行的例程。上半部处理程序和下半部处理程序之间的最大区别在于,在执行下半部期间启用所有中断 - 这就是它在更安全的时间运行的原因。在典型场景中,上半部分将设备数据保存到设备特定的缓冲区,调度其下半部分,然后退出:此操作非常快。然后下半部分执行所需的任何其他工作,例如唤醒进程、启动另一个 I/O 操作等。此设置允许上半部分服务新的中断,而下半部分仍在工作。

Linux (along with many other systems) resolves this problem by splitting the interrupt handler into two halves. The so-called top half is the routine that actually responds to the interrupt—the one you register with request_irq. The bottom half is a routine that is scheduled by the top half to be executed later, at a safer time. The big difference between the top-half handler and the bottom half is that all interrupts are enabled during execution of the bottom half—that's why it runs at a safer time. In the typical scenario, the top half saves device data to a device-specific buffer, schedules its bottom half, and exits: this operation is very fast. The bottom half then performs whatever other work is required, such as awakening processes, starting up another I/O operation, and so on. This setup permits the top half to service a new interrupt while the bottom half is still working.

几乎每个重要的中断处理程序都是这样划分的。例如,当网络接口报告新数据包到达时,处理程序只是检索数据并将其推送到协议层;数据包的实际处理是在下半部分执行的。

Almost every serious interrupt handler is split this way. For instance, when a network interface reports the arrival of a new packet, the handler just retrieves the data and pushes it up to the protocol layer; actual processing of the packet is performed in a bottom half.

Linux 内核有两种不同的机制可用于实现下半部处理,这两种机制都在第 7 章中介绍过。Tasklet 通常是下半部处理的首选机制;它们非常快,但所有微线程代码都必须是原子的。Tasklet 的替代方案是工作队列,它可能具有较高的延迟,但允许休眠。

The Linux kernel has two different mechanisms that may be used to implement bottom-half processing, both of which were introduced in Chapter 7. Tasklets are often the preferred mechanism for bottom-half processing; they are very fast, but all tasklet code must be atomic. The alternative to tasklets is workqueues, which may have a higher latency but that are allowed to sleep.

以下讨论再次适用于驱动程序。当加载模块选项时,可以告诉Short使用微线程或工作队列处理程序以上半部/下半部模式进行中断处理。在这种情况下,上半部分执行得很快;它只是记住当前时间并安排下半部分处理。然后,下半部分负责这次编码并唤醒可能正在等待数据的任何用户进程。

The following discussion works, once again, with the short driver. When loaded with a module option, short can be told to do interrupt processing in a top/bottom-half mode with either a tasklet or workqueue handler. In this case, the top half executes quickly; it simply remembers the current time and schedules the bottom half processing. The bottom half is then charged with encoding this time and awakening any user processes that may be waiting for data.

小任务

Tasklets

请记住,微线程是一个特殊的函数,可以安排在软件中断上下文中、在系统确定的安全时间运行。它们可能会被调度运行多次,但是tasklet的调度是不累积的;即使在启动之前多次请求该微线程,该微线程也仅运行一次。没有一个tasklet 会与其自身并行运行,因为它们只运行一次,但tasklet 可以与SMP 系统上的其他tasklet 并行运行。因此,如果您的驱动程序有多个微线程,它们必须采用某种锁定以避免彼此冲突。

Remember that tasklets are a special function that may be scheduled to run, in software interrupt context, at a system-determined safe time. They may be scheduled to run multiple times, but tasklet scheduling is not cumulative; the tasklet runs only once, even if it is requested repeatedly before it is launched. No tasklet ever runs in parallel with itself, since they run only once, but tasklets can run in parallel with other tasklets on SMP systems. Thus, if your driver has multiple tasklets, they must employ some sort of locking to avoid conflicting with each other.

Tasklet 还保证与首先调度它们的函数在同一 CPU 上运行。因此,中断处理程序可以确保在处理程序完成之前,tasklet 不会开始执行。然而,当微线程运行时当然可以传递另一个中断,因此可能仍然需要微线程和中断处理程序之间的锁定。

Tasklets are also guaranteed to run on the same CPU as the function that first schedules them. Therefore, an interrupt handler can be secure that a tasklet does not begin executing before the handler has completed. However, another interrupt can certainly be delivered while the tasklet is running, so locking between the tasklet and the interrupt handler may still be required.

Tasklet 必须用DECLARE_TASKLET宏声明:

Tasklets must be declared with the DECLARE_TASKLET macro:

DECLARE_TASKLET(名称、函数、数据);
DECLARE_TASKLET(name, function, data);

name是要赋予 tasklet 的名称, function是被调用以执行 tasklet 的函数(它采用一个unsigned long参数并返回void),并且data是要传递给tasklet函数的无符号长整型值。

name is the name to be given to the tasklet, function is the function that is called to execute the tasklet (it takes one unsigned long argument and returns void), and data is an unsigned long value to be passed to the tasklet function.

短驱动程序声明其 tasklet 如下

The short driver declares its tasklet as follows:

无效short_do_tasklet(无符号长);
DECLARE_TASKLET(short_tasklet,short_do_tasklet,0);
void short_do_tasklet(unsigned long);
DECLARE_TASKLET(short_tasklet, short_do_tasklet, 0);

函数tasklet_schedule用于调度tasklet运行。如果加载了shorttasklet=1,它会安装一个不同的中断处理程序来保存数据并按如下方式调度tasklet:

The function tasklet_schedule is used to schedule a tasklet for running. If short is loaded with tasklet=1, it installs a different interrupt handler that saves data and schedules the tasklet as follows:

irqreturn_t Short_tl_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
    do_gettimeofday((struct timeval *) tv_head); /* 强制转换以停止“易失性”警告 */
    Short_incr_tv(&tv_head);
    tasklet_schedule(&short_tasklet);
    短_wq_count++;/* 记录中断到达 */
    返回IRQ_HANDLED;
}
irqreturn_t short_tl_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
    do_gettimeofday((struct timeval *) tv_head); /* cast to stop 'volatile' warning */
    short_incr_tv(&tv_head);
    tasklet_schedule(&short_tasklet);
    short_wq_count++; /* record that an interrupt arrived */
    return IRQ_HANDLED;
}

实际的tasklet例程short_do_tasklet将在系统方便时很快执行(可以这么说)。如前所述,该例程执行处理中断的大部分工作;它看起来像这样:

The actual tasklet routine, short_do_tasklet, will be executed shortly (so to speak) at the system's convenience. As mentioned earlier, this routine performs the bulk of the work of handling the interrupt; it looks like this:

void Short_do_tasklet(无符号长未使用)
{
    int savecount = Short_wq_count,写入;
    Short_wq_count = 0; /* 我们已经从队列中删除了 */
    /*
     * 下半部分读取tv数组,由上半部分填充,
     * 并将其打印到循环文本缓冲区,然后被消耗
     * 通过读取过程
     */

    /* 先写入这个bh之前发生的中断次数 */
    写入 = sprintf((char *)short_head,"bh after %6i\n",savecount);
    Short_incr_bp(&short_head, 写入);

    /*
     * 然后,写入时间值。一次正好写入 16 个字节,
     * 因此它与 PAGE_SIZE 对齐
     */

    做 {
        写入 = sprintf((char *)short_head,"%08u.%06u\n",
                (int)(tv_tail->tv_sec % 100000000),
                (int)(tv_tail->tv_usec));
        Short_incr_bp(&short_head, 写入);
        Short_incr_tv(&tv_tail);
    while (tv_tail != tv_head);

    wake_up_interruptible(&short_queue); /* 唤醒任何读进程 */
}
void short_do_tasklet (unsigned long unused)
{
    int savecount = short_wq_count, written;
    short_wq_count = 0; /* we have already been removed from the queue */
    /*
     * The bottom half reads the tv array, filled by the top half,
     * and prints it to the circular text buffer, which is then consumed
     * by reading processes
     */

    /* First write the number of interrupts that occurred before this bh */
    written = sprintf((char *)short_head,"bh after %6i\n",savecount);
    short_incr_bp(&short_head, written);

    /*
     * Then, write the time values. Write exactly 16 bytes at a time,
     * so it aligns with PAGE_SIZE
     */

    do {
        written = sprintf((char *)short_head,"%08u.%06u\n",
                (int)(tv_tail->tv_sec % 100000000),
                (int)(tv_tail->tv_usec));
        short_incr_bp(&short_head, written);
        short_incr_tv(&tv_tail);
    } while (tv_tail != tv_head);

    wake_up_interruptible(&short_queue); /* awake any reading process */
}

除其他事项外,该微线程还记录自上次调用以来已到达的中断数量。像Short这样的设备可以在短时间内产生大量中断,因此在执行下半部分之前几个中断到达的情况并不少见。驾驶员必须始终为这种可能性做好准备,并且必须能够根据上半部分留下的信息确定需要执行多少工作。

Among other things, this tasklet makes a note of how many interrupts have arrived since it was last called. A device such as short can generate a great many interrupts in a brief period, so it is not uncommon for several to arrive before the bottom half is executed. Drivers must always be prepared for this possibility and must be able to determine how much work there is to perform from the information left by the top half.

工作队列

Workqueues

回想起那个 工作队列在未来某个时间在特殊工作进程的上下文中调用函数。由于工作队列函数在进程上下文中运行,因此如果需要它可以休眠。但是,您不能将数据从工作队列复制到用户空间,除非您使用我们在第 15 章中演示的高级技术;工作进程无权访问任何其他进程的地址空间。

Recall that workqueues invoke a function at some future time in the context of a special worker process. Since the workqueue function runs in process context, it can sleep if need be. You cannot, however, copy data into user space from a workqueue, unless you use the advanced techniques we demonstrate in Chapter 15; the worker process does not have access to any other process's address space.

如果加载了设置为非零值的选项,驱动程序wq将使用工作队列进行其下半部处理。它使用系统默认的工作队列,因此不需要特殊的设置代码;如果您的驱动程序有特殊的延迟要求(或者可能在工作队列函数中休眠很长时间),您可能需要创建自己的专用工作队列。我们确实需要一个work_struct 结构体,它的声明和初始化如下:

The short driver, if loaded with the wq option set to a nonzero value, uses a workqueue for its bottom-half processing. It uses the system default workqueue, so there is no special setup code required; if your driver has special latency requirements (or might sleep for a long time in the workqueue function), you may want to create your own, dedicated workqueue. We do need a work_struct structure, which is declared and initialized with the following:

静态结构work_structshort_wq;

    /* 该行位于short_init() 中 */
    INIT_WORK(&short_wq, (void (*)(void *)) Short_do_tasklet, NULL);
static struct work_struct short_wq;

    /* this line is in short_init(  ) */
    INIT_WORK(&short_wq, (void (*)(void *)) short_do_tasklet, NULL);

我们的工作函数是short_do_tasklet,我们已经在上一节中看到过。

Our worker function is short_do_tasklet, which we have already seen in the previous section.

当使用工作队列时,short会建立另一个中断处理程序,如下所示:

When working with a workqueue, short establishes yet another interrupt handler that looks like this:

irqreturn_t Short_wq_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
    /* 获取当前时间信息。*/
    do_gettimeofday((struct timeval *) tv_head);
    Short_incr_tv(&tv_head);

    /* 对 bh 进行排队。不用担心多次排队*/
    日程安排工作(&short_wq);

    短_wq_count++;/* 记录中断到达 */
    返回IRQ_HANDLED;
}
irqreturn_t short_wq_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
    /* Grab the current time information. */
    do_gettimeofday((struct timeval *) tv_head);
    short_incr_tv(&tv_head);

    /* Queue the bh. Don't worry about multiple enqueueing */
    schedule_work(&short_wq);

    short_wq_count++; /* record that an interrupt arrived */
    return IRQ_HANDLED;
}

如您所见,中断处理程序 看起来非常像tasklet 版本,不同之处在于它调用schedule_work来安排下半部处理。

As you can see, the interrupt handler looks very much like the tasklet version, with the exception that it calls schedule_work to arrange the bottom-half processing.

中断共享

Interrupt Sharing

IRQ冲突的概念是 几乎是PC架构的代名词。过去,PC 上的 IRQ 线无法为多个设备提供服务,而且数量永远不够。因此,沮丧的用户经常花费大量时间打开计算机机箱,试图找到一种方法让所有外围设备都能很好地协同工作。

The notion of an IRQ conflict is almost synonymous with the PC architecture. In the past, IRQ lines on the PC have not been able to serve more than one device, and there have never been enough of them. As a result, frustrated users have often spent much time with their computer case open, trying to find a way to make all of their peripherals play well together.

当然,现代硬件的设计允许共享中断;PCI 总线需要它。因此,Linux 内核支持所有总线上的中断共享,甚至包括那些传统上不支持共享的总线(例如 ISA 总线)。如果目标硬件可以支持该操作模式,则应编写 2.6 内核的设备驱动程序以使用共享中断。幸运的是,大多数时候,使用共享中断很容易。

Modern hardware, of course, has been designed to allow the sharing of interrupts; the PCI bus requires it. Therefore, the Linux kernel supports interrupt sharing on all buses, even those (such as the ISA bus) where sharing has traditionally not been supported. Device drivers for the 2.6 kernel should be written to work with shared interrupts if the target hardware can support that mode of operation. Fortunately, working with shared interrupts is easy, most of the time.

安装共享处理程序

Installing a Shared Handler

共享中断通过安装 request_irq与非共享的一样,但有两个区别:

Shared interrupts are installed through request_irq just like nonshared ones, but there are two differences:

  • 请求中断时,SA_SHIRQ必须在参数中指定该位。flags

  • The SA_SHIRQ bit must be specified in the flags argument when requesting the interrupt.

  • 论证 必须dev_id是唯一的。任何指向模块地址空间的指针都可以,但绝对不能设置为.dev_idNULL

  • The dev_id argument must be unique. Any pointer into the module's address space will do, but dev_id definitely cannot be set to NULL.

内核保留与中断关联的共享处理程序列表,并且dev_id可以将其视为区分它们的签名。如果两个驱动程序在同一个中断上注册NULL为它们的签名,则在卸载时事情可能会混淆,导致内核在中断到达时发生错误。因此,如果NULL dev_id在注册共享中断时通过了 a,现代内核会大声抱怨。当请求共享中断时,如果满足以下条件之一,request_irq就会成功:

The kernel keeps a list of shared handlers associated with the interrupt, and dev_id can be thought of as the signature that differentiates between them. If two drivers were to register NULL as their signature on the same interrupt, things might get mixed up at unload time, causing the kernel to oops when an interrupt arrived. For this reason, modern kernels complain loudly if passed a NULL dev_id when registering shared interrupts. When a shared interrupt is requested, request_irq succeeds if one of the following is true:

  • 中断线是空闲的。

  • The interrupt line is free.

  • 已为该线路注册的所有处理程序也已指定要共享 IRQ。

  • All handlers already registered for that line have also specified that the IRQ is to be shared.

每当两个或多个驱动程序共享一条中断线并且硬件中断该线上的处理器时,内核就会调用为该中断注册的每个处理程序,并传递每个处理程序自己的dev_id. 因此,共享处理程序必须能够识别自己的中断,并且应该在自己的设备没有中断时快速退出。IRQ_NONE每当调用处理程序并发现设备没有中断时,请务必返回。

Whenever two or more drivers are sharing an interrupt line and the hardware interrupts the processor on that line, the kernel invokes every handler registered for that interrupt, passing each its own dev_id. Therefore, a shared handler must be able to recognize its own interrupts and should quickly exit when its own device has not interrupted. Be sure to return IRQ_NONE whenever your handler is called and finds that the device is not interrupting.

如果您需要在请求 IRQ 线之前探测您的设备,内核将无法帮助您。共享处理程序没有可用的探测功能。如果正在使用的线路是空闲的,则标准探测机制可以工作,但如果该线路已由具有共享功能的另一个驱动程序占用,则即使您的驱动程序可以完美工作,探测也会失败。幸运的是,大多数为中断共享而设计的硬件也能够告诉处理器它正在使用哪个中断,从而消除了显式探测的需要。

If you need to probe for your device before requesting the IRQ line, the kernel can't help you. No probing function is available for shared handlers. The standard probing mechanism works if the line being used is free, but if the line is already held by another driver with sharing capabilities, the probe fails, even if your driver would have worked perfectly. Fortunately, most hardware designed for interrupt sharing is also able to tell the processor which interrupt it is using, thus eliminating the need for explicit probing.

释放处理程序是在正常情况下执行的方式,使用free_irq。这里的dev_id 参数用于选择正确的处理程序以从中断的共享处理程序列表中释放。这就是为什么dev_id指针必须是唯一的。

Releasing the handler is performed in the normal way, using free_irq. Here the dev_id argument is used to select the correct handler to release from the list of shared handlers for the interrupt. That's why the dev_id pointer must be unique.

使用共享处理程序的驱动程序还需要注意一件事:它不能使用 enable_irqdisable_irq。如果是这样,共享线路的其他设备可能会陷入混乱。即使是很短的时间禁用另一个设备的中断也可能会产生延迟,从而对该设备及其用户造成问题。一般来说,程序员必须记住,他的驱动程序不拥有 IRQ,并且如果拥有中断线,则其行为应该比所需的更加“社交”。

A driver using a shared handler needs to be careful about one more thing: it can't play with enable_irq or disable_irq. If it does, things might go haywire for other devices sharing the line; disabling another device's interrupts for even a short time may create latencies that are problematic for that device and it's user. Generally, the programmer must remember that his driver doesn't own the IRQ, and its behavior should be more "social" than is necessary if one owns the interrupt line.

运行处理程序

Running the Handler

正如之前所建议的,当内核收到一个中断,所有注册的处理程序都会被调用。共享处理程序必须能够区分它需要处理的中断和其他设备生成的中断。

As suggested earlier, when the kernel receives an interrupt, all the registered handlers are invoked. A shared handler must be able to distinguish between interrupts that it needs to handle and interrupts generated by other devices.

使用选项shared=1加载short 会安装以下处理程序而不是默认处理程序:

Loading short with the option shared=1 installs the following handler instead of the default:

irqreturn_t Short_sh_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
    int 值,书面形式;
    结构 timeval 电视;

    /* 如果不短,则立即返回 */
    值= inb(short_base);
    if (!(值 & 0x80))
        返回IRQ_NONE;
    
    /* 清除中断位 */
    outb(值&0x7F,short_base);

    /* 其余不变*/

    do_gettimeofday(&tv);
    写入 = sprintf((char *)short_head,"%08u.%06u\n",
            (int)(tv.tv_sec % 100000000), (int)(tv.tv_usec));
    Short_incr_bp(&short_head, 写入);
    wake_up_interruptible(&short_queue); /* 唤醒任何读进程 */
    返回IRQ_HANDLED;
}
irqreturn_t short_sh_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
    int value, written;
    struct timeval tv;

    /* If it wasn't short, return immediately */
    value = inb(short_base);
    if (!(value & 0x80))
        return IRQ_NONE;
    
    /* clear the interrupting bit */
    outb(value & 0x7F, short_base);

    /* the rest is unchanged */

    do_gettimeofday(&tv);
    written = sprintf((char *)short_head,"%08u.%06u\n",
            (int)(tv.tv_sec % 100000000), (int)(tv.tv_usec));
    short_incr_bp(&short_head, written);
    wake_up_interruptible(&short_queue); /* awake any reading process */
    return IRQ_HANDLED;
}

这里需要做出解释。由于并行端口没有要检查的“中断挂起”位,因此处理程序使用 ACK 位来实现此目的。如果该位为高,则报告的中断是短暂的并且处理程序会清除该位。

An explanation is due here. Since the parallel port has no "interrupt-pending" bit to check, the handler uses the ACK bit for this purpose. If the bit is high, the interrupt being reported is for short, and the handler clears the bit.

处理程序通过将并行接口数据端口的高位清零来重置该位 -假设引脚 9 和 10 连接在一起。如果与Short共享 IRQ 的其他设备之一生成中断,则Short会发现自己的线路仍然处于非活动状态并且不执行任何操作。

The handler resets the bit by zeroing the high bit of the parallel interface's data port—short assumes that pins 9 and 10 are connected together. If one of the other devices sharing the IRQ with short generates an interrupt, short sees that its own line is still inactive and does nothing.

当然,功能齐全的驱动程序可能会将工作分为上半部分和下半部分,但这很容易添加,并且不会对实现共享的代码产生任何影响。真正的驱动程序还可能使用该dev_id参数来确定可能有多个设备中的哪些设备可能会中断。

A full-featured driver probably splits the work into top and bottom halves, of course, but that's easy to add and does not have any impact on the code that implements sharing. A real driver would also likely use the dev_id argument to determine which, of possibly many, devices might be interrupting.

请注意,如果您使用打印机(而不是跳线)来测试Short中断管理,则此共享处理程序将无法像宣传的那样工作,因为打印机协议不允许共享,并且驱动程序无法知道中断是否来自打印机。

Note that if you are using a printer (instead of the jumper wire) to test interrupt management with short, this shared handler won't work as advertised, because the printer protocol doesn't allow for sharing, and the driver can't know whether the interrupt was from the printer.

/proc 接口和共享中断

The /proc Interface and Shared Interrupts

在系统中安装共享处理程序不会影响 /proc/stat,它甚至不知道处理程序。然而,/proc/interrupts略有变化。

Installing shared handlers in the system doesn't affect /proc/stat, which doesn't even know about handlers. However, /proc/interrupts changes slightly.

为同一中断号安装的所有处理程序都出现在 /proc/interrupts的同一行上。以下输出(来自 x86_64 系统)显示了共享中断处理程序的显示方式:

All the handlers installed for the same interrupt number appear on the same line of /proc/interrupts. The following output (from an x86_64 system) shows how shared interrupt handlers are displayed:

           中央处理器0       
  0:892335412 XT-PIC定时器
  1:453971XT-PIC i8042
  2:0 XT-PIC级联
  5:0 XT-PIC libata,ehci_hcd
  8:0 XT-PIC 实时时钟
  9:0 XT-PIC ACPI
 10:11365067 XT-PIC ide2、uhci_hcd、uhci_hcd、SysKonnect SK-98xx、EMU10K1
 11:4391962 XT-PIC uhci_hcd,uhci_hcd
 12:224XT-PIC i8042
 14:2787721 XT-PIC ide0
 15:203048 XT-PIC ide1
国家管理机构:41234
联系电话:892193503
错误:102
管理信息系统:0
           CPU0       
  0:  892335412         XT-PIC  timer
  1:     453971         XT-PIC  i8042
  2:          0         XT-PIC  cascade
  5:          0         XT-PIC  libata, ehci_hcd
  8:          0         XT-PIC  rtc
  9:          0         XT-PIC  acpi
 10:   11365067         XT-PIC  ide2, uhci_hcd, uhci_hcd, SysKonnect SK-98xx, EMU10K1
 11:    4391962         XT-PIC  uhci_hcd, uhci_hcd
 12:        224         XT-PIC  i8042
 14:    2787721         XT-PIC  ide0
 15:     203048         XT-PIC  ide1
NMI:      41234 
LOC:  892193503 
ERR:        102
MIS:          0

该系统有多个共享中断线。IRQ 5 用于串行 ATA 和 USB 2.0 控制器;IRQ 10 有多个设备,包括一个 IDE 控制器、两个 USB 控制器、一个以太网 接口和声卡;IRQ 11 也被两个 USB 控制器使用。

This system has several shared interrupt lines. IRQ 5 is used for the serial ATA and USB 2.0 controllers; IRQ 10 has several devices, including an IDE controller, two USB controllers, an Ethernet interface, and a sound card; and IRQ 11 also is used by two USB controllers.

中断驱动 I/O

Interrupt-Driven I/O

每当一个数据 与托管硬件之间的传输可能因任何原因而延迟,驱动程序编写者应实现缓冲。数据缓冲区有助于将数据传输和接收与 写入读取系统调用分离,并提高整体系统性能。

Whenever a data transfer to or from the managed hardware might be delayed for any reason, the driver writer should implement buffering. Data buffers help to detach data transmission and reception from the write and read system calls, and overall system performance benefits.

良好的缓冲机制会导致中断驱动的 I/O,其中输入缓冲区在中断时被填充,并由读取设备的进程清空;输出缓冲区由写入设备的进程填充,并在中断时清空。中断驱动输出的一个示例是/dev/shortprint的实现。

A good buffering mechanism leads to interrupt-driven I/O, in which an input buffer is filled at interrupt time and is emptied by processes that read the device; an output buffer is filled by processes that write to the device and is emptied at interrupt time. An example of interrupt-driven output is the implementation of /dev/shortprint.

为了成功地进行中断驱动的数据传输,硬件应该能够生成具有以下语义的中断:

For interrupt-driven data transfer to happen successfully, the hardware should be able to generate interrupts with the following semantics:

  • 对于输入,当新数据到达并准备好由系统处理器检索时,设备会中断处理器。要执行的实际操作取决于设备是否使用 I/O 端口、内存映射或 DMA。

  • For input, the device interrupts the processor when new data has arrived and is ready to be retrieved by the system processor. The actual actions to perform depend on whether the device uses I/O ports, memory mapping, or DMA.

  • 对于输出,设备在准备好接受新数据或确认数据传输成功时会发出中断。内存映射和支持 DMA 的设备通常会生成中断来告诉系统它们已完成缓冲区的处理。

  • For output, the device delivers an interrupt either when it is ready to accept new data or to acknowledge a successful data transfer. Memory-mapped and DMA-capable devices usually generate interrupts to tell the system they are done with the buffer.

与数据实际到达之间的时序关系已在第6 章6.2.3 节中介绍。

The timing relationships between a read or write and the actual arrival of data were introduced in Section 6.2.3 in Chapter 6.

写缓冲示例

A Write-Buffering Example

我们已经提到过简写 司机几次;现在是时候真正看看了。该模块 为并行端口实现一个非常简单的、面向输出的驱动程序;然而,启用文件打印就足够了。但是,如果您选择测试此驱动程序,请记住您必须向打印机传递其可以理解的格式的文件;当给定任意数据流时,并非所有打印机都能很好地响应。

We have mentioned the shortprint driver a couple of times; now it is time to actually take a look. This module implements a very simple, output-oriented driver for the parallel port; it is sufficient, however, to enable the printing of files. If you chose to test this driver out, however, remember that you must pass the printer a file in a format it understands; not all printers respond well when given a stream of arbitrary data.

Shortprint驱动程序维护一页循环输出缓冲区。当用户空间进程将数据写入设备时,该数据会被送入缓冲区,但write方法实际上并不执行任何 I/O。相反, shortp_write的核心如下所示:

The shortprint driver maintains a one-page circular output buffer. When a user-space process writes data to the device, that data is fed into the buffer, but the write method does not actually perform any I/O. Instead, the core of shortp_write looks like this:

    while(写<计数){
        /* 挂起直到有可用的缓冲区空间。*/
        空间=shortp_out_space();
        如果(空格 <= 0){
            如果(等待事件中断(shortp_out_queue,
                        (空间 = Shortp_out_space( )) > 0))
                转到出去;
        }

        /* 将数据移入缓冲区。*/
        if ((空格 + 书写) > 计数)
            空间=计数-写入;
        if (copy_from_user((char *) Shortp_out_head, buf, space)) {
            向上(&shortp_out_sem);
            返回-EFAULT;
        }
        Shortp_incr_out_bp(&shortp_out_head, 空格);
        buf += 空间;
        写+=空格;

        /* 如果没有输出处于活动状态,则使其处于活动状态。*/
        spin_lock_irqsave(&shortp_out_lock, 标志);
        if (!shortp_output_active)
            Shortp_start_output();
        spin_unlock_irqrestore(&shortp_out_lock, 标志);
    }

出去:
    *f_pos += 写入;
    while (written < count) {
        /* Hang out until some buffer space is available. */
        space = shortp_out_space(  );
        if (space <= 0) {
            if (wait_event_interruptible(shortp_out_queue,
                        (space = shortp_out_space(  )) > 0))
                goto out;
        }

        /* Move data into the buffer. */
        if ((space + written) > count)
            space = count - written;
        if (copy_from_user((char *) shortp_out_head, buf, space)) {
            up(&shortp_out_sem);
            return -EFAULT;
        }
        shortp_incr_out_bp(&shortp_out_head, space);
        buf += space;
        written += space;

        /* If no output is active, make it active. */
        spin_lock_irqsave(&shortp_out_lock, flags);
        if (! shortp_output_active)
            shortp_start_output(  );
        spin_unlock_irqrestore(&shortp_out_lock, flags);
    }

out:
    *f_pos += written;

信号量 ( shortp_out_sem) 控制对循环缓冲区的访问;Shortp_write获取上述代码片段之前的信号量。在持有信号量的同时,它尝试将数据送入循环缓冲区。函数shortp_out_space返回可用的连续空间量(因此无需担心缓冲区换行);如果该数量为0,则驱动程序将等待直到释放一些空间。然后它将尽可能多的数据复制到缓冲区中。

A semaphore (shortp_out_sem) controls access to the circular buffer; shortp_write obtains that semaphore just prior to the code fragment above. While holding the semaphore, it attempts to feed data into the circular buffer. The function shortp_out_space returns the amount of contiguous space available (so there is no need to worry about buffer wraps); if that amount is 0, the driver waits until some space is freed. It then copies as much data as it can into the buffer.

一旦有数据要输出,shortp_write必须确保数据写入设备。实际的写入是通过 工作队列函数完成的;如果该函数尚未运行,则shortp_write必须将该函数启动。在获得一个单独的自旋锁来控制对输出缓冲区的消费者端使用的变量(包括 shortp_output_active)的访问之后, 如果需要,它会调用shortp_start_output 。然后只需注意有多少数据被“写入”缓冲区并返回即可。

Once there is data to output, shortp_write must ensure that the data is written to the device. The actual writing is done by way of a workqueue function; shortp_write must kick that function off if it is not already running. After obtaining a separate spinlock that controls access to variables used on the consumer side of the output buffer (including shortp_output_active), it calls shortp_start_output if need be. Then it's just a matter of noting how much data was "written" to the buffer and returning.

启动输出过程的函数如下所示:

The function that starts the output process looks like the following:

静态无效shortp_start_output(无效)
{
    if (shortp_output_active) /* 永远不应该发生 */
        返回;

    /* 设置我们的“错过中断”计时器 */
    Shortp_output_active = 1;
    Shortp_timer.expires = jiffies + 超时;
    add_timer(&shortp_timer);

    /* 并使进程继续进行。*/
    队列工作(shortp_workqueue,&shortp_work);
}
static void shortp_start_output(void)
{
    if (shortp_output_active) /* Should never happen */
        return;

    /* Set up our 'missed interrupt' timer */
    shortp_output_active = 1;
    shortp_timer.expires = jiffies + TIMEOUT;
    add_timer(&shortp_timer);

    /*  And get the process going. */
    queue_work(shortp_workqueue, &shortp_work);
}

处理硬件的现实情况是,您有时可能会丢失设备的中断。发生这种情况时,您确实不希望驱动程序永远停止,直到系统重新启动为止;这不是一种用户友好的做事方式。最好意识到错过了一个中断,收拾残局,然后继续。为此,shortprint每当向设备输出数据时都会设置一个内核计时器。如果定时器到期,我们可能错过了一个中断。我们很快就会了解计时器函数,但目前我们还是继续讨论主要的输出功能。这是在我们的工作队列中实现的正如您在上面看到的,该函数被安排在此处。该函数的核心如下所示:

The reality of dealing with hardware is that you can, occasionally, lose an interrupt from the device. When this happens, you really do not want your driver to stop forevermore until the system is rebooted; that is not a user-friendly way of doing things. It is far better to realize that an interrupt has been missed, pick up the pieces, and go on. To that end, shortprint sets a kernel timer whenever it outputs data to the device. If the timer expires, we may have missed an interrupt. We look at the timer function shortly, but, for the moment, let's stick with the main output functionality. That is implemented in our workqueue function, which, as you can see above, is scheduled here. The core of that function looks like the following:

    spin_lock_irqsave(&shortp_out_lock, 标志);

    /* 我们已经写完了所有内容吗?*/
    if (shortp_out_head == Shortp_out_tail) { /* 空 */
        Shortp_output_active = 0;
        wake_up_interruptible(&shortp_empty_queue);
        del_timer(&shortp_timer);  
    }
    /* 不,再写一个字节 */
    别的
        短p_do_write();

    /* 如果有人在等,也许可以叫醒他们。*/
    if (((PAGE_SIZE + Shortp_out_tail - Shortp_out_head) % PAGE_SIZE) > SP_MIN_SPACE)
    {
        wake_up_interruptible(&shortp_out_queue);
    }
    spin_unlock_irqrestore(&shortp_out_lock, 标志);
    spin_lock_irqsave(&shortp_out_lock, flags);

    /* Have we written everything? */
    if (shortp_out_head =  = shortp_out_tail) { /* empty */
        shortp_output_active = 0;
        wake_up_interruptible(&shortp_empty_queue);
        del_timer(&shortp_timer);  
    }
    /* Nope, write another byte */
    else
        shortp_do_write(  );

    /* If somebody's waiting, maybe wake them up. */
    if (((PAGE_SIZE + shortp_out_tail - shortp_out_head) % PAGE_SIZE) > SP_MIN_SPACE) 
    {
        wake_up_interruptible(&shortp_out_queue);
    }
    spin_unlock_irqrestore(&shortp_out_lock, flags);

由于我们处理的是输出端的共享变量,因此我们必须获得自旋锁。然后我们查看是否还有数据要发送;如果没有,我们注意到输出不再活动,删除计时器,并唤醒可能一直在等待队列完全变空的任何人(这种等待是在设备关闭时完成的)。相反,如果还有剩余数据要写入,我们会调用 Shortp_do_write来实际向硬件发送一个字节。

Since we are dealing with the output side's shared variables, we must obtain the spinlock. Then we look to see whether there is any more data to send out; if not, we note that output is no longer active, delete the timer, and wake up anybody who might have been waiting for the queue to become completely empty (this sort of wait is done when the device is closed). If, instead, there remains data to write, we call shortp_do_write to actually send a byte to the hardware.

然后,由于我们可能已经释放了输出缓冲区中的空间,因此我们考虑唤醒任何等待向该缓冲区添加更多数据的进程。然而,我们并不是无条件地执行唤醒;相反,我们会等到有最小可用空间为止。每次我们从缓冲区中取出一个字节时都没有必要唤醒写入器;唤醒进程、安排其运行以及使其重新进入睡眠状态的成本对此来说太高了。相反,我们应该等到该进程能够立即将大量数据移入缓冲区。这种技术在缓冲、中断驱动的驱动程序中很常见。

Then, since we may have freed space in the output buffer, we consider waking up any processes waiting to add more data to that buffer. We do not perform that wakeup unconditionally, however; instead, we wait until a minimum amount of space is available. There is no point in awakening a writer every time we take one byte out of the buffer; the cost of awakening the process, scheduling it to run, and putting it back to sleep is too high for that. Instead, we should wait until that process is able to move a substantial amount of data into the buffer at once. This technique is common in buffering, interrupt-driven drivers.

为了完整起见,以下是实际将数据写入端口的代码:

For completeness, here is the code that actually writes the data to the port:

静态无效shortp_do_write(无效)
{
    无符号字符 cr = inb(shortp_base + SP_CONTROL);

    /* 发生了一些事; 重置定时器*/
    mod_timer(&shortp_timer, jiffies + 超时);

    /* 选通一个字节到设备 */
    outb_p(*shortp_out_tail, Shortp_base+SP_DATA);
    Shortp_incr_out_bp(&shortp_out_tail, 1);
    如果(短p_延迟)
        udelay(shortp_delay);
    outb_p(cr | SP_CR_STROBE, Shortp_base+SP_CONTROL);
    如果(短p_延迟)
        udelay(shortp_delay);
    outb_p(cr & ~SP_CR_STROBE, Shortp_base+SP_CONTROL);
}
static void shortp_do_write(void)
{
    unsigned char cr = inb(shortp_base + SP_CONTROL);

    /* Something happened; reset the timer */
    mod_timer(&shortp_timer, jiffies + TIMEOUT);

    /* Strobe a byte out to the device */
    outb_p(*shortp_out_tail, shortp_base+SP_DATA);
    shortp_incr_out_bp(&shortp_out_tail, 1);
    if (shortp_delay)
        udelay(shortp_delay);
    outb_p(cr | SP_CR_STROBE, shortp_base+SP_CONTROL);
    if (shortp_delay)
        udelay(shortp_delay);
    outb_p(cr & ~SP_CR_STROBE, shortp_base+SP_CONTROL);
}

在这里,我们重置计时器以反映我们已经取得了一些进展的事实,将字节选通到设备,并更新循环缓冲区指针。

Here, we reset the timer to reflect the fact that we have made some progress, strobe the byte out to the device, and update the circular buffer pointer.

工作队列函数不会直接重新提交自身,因此只会将一个字节写入设备。在某个时刻,打印机将以缓慢的方式消耗该字节并为下一个字节做好准备;然后它会中断处理器。Shortprint中使用的中断处理程序简短而简单:

The workqueue function does not resubmit itself directly, so only a single byte will be written to the device. At some point, the printer will, in its slow way, consume the byte and become ready for the next one; it will then interrupt the processor. The interrupt handler used in shortprint is short and simple:

静态 irqreturn_t Shortp_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
    if (!shortp_output_active)
        返回IRQ_NONE;

    /* 记住时间,并将其余时间交给工作队列函数 */
    do_gettimeofday(&shortp_tv);
    队列工作(shortp_workqueue,&shortp_work);
    返回IRQ_HANDLED;
}
static irqreturn_t shortp_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
    if (! shortp_output_active) 
        return IRQ_NONE;

    /* Remember the time, and farm off the rest to the workqueue function */ 
    do_gettimeofday(&shortp_tv);
    queue_work(shortp_workqueue, &shortp_work);
    return IRQ_HANDLED;
}

由于并行端口不需要显式中断确认,因此中断处理程序真正需要做的就是告诉内核再次运行工作队列 函数

Since the parallel port does not require an explicit interrupt acknowledgment, all the interrupt handler really needs to do is to tell the kernel to run the workqueue function again.

如果中断永远不会到来怎么办?到目前为止我们看到的驱动程序代码只会停止。为了防止这种情况发生,我们将计时器调回了几页前。当定时器到期时执行的函数是:

What if the interrupt never comes? The driver code that we have seen thus far would simply come to a halt. To keep that from happening, we set a timer back a few pages ago. The function that is executed when that timer expires is:

静态无效shortp_timeout(无符号长未使用)
{
    无符号长标志;
    无符号字符状态;
   
    if (!shortp_output_active)
        返回;
    spin_lock_irqsave(&shortp_out_lock, 标志);
    状态 = inb(shortp_base + SP_STATUS);

    /* 如果打印机仍然忙,我们只需重置计时器 */
    if ((状态 & SP_SR_BUSY) == 0 || (状态 & SP_SR_ACK)) {
        Shortp_timer.expires = jiffies + 超时;
        add_timer(&shortp_timer);
        spin_unlock_irqrestore(&shortp_out_lock, 标志);
        返回;
    }

    /* 否则我们一定已经丢弃了一个中断。*/
    spin_unlock_irqrestore(&shortp_out_lock, 标志);
    Shortp_interrupt(shortp_irq, NULL, NULL);
}
static void shortp_timeout(unsigned long unused)
{
    unsigned long flags;
    unsigned char status;
   
    if (! shortp_output_active)
        return;
    spin_lock_irqsave(&shortp_out_lock, flags);
    status = inb(shortp_base + SP_STATUS);

    /* If the printer is still busy we just reset the timer */
    if ((status & SP_SR_BUSY) =  = 0 || (status & SP_SR_ACK)) {
        shortp_timer.expires = jiffies + TIMEOUT;
        add_timer(&shortp_timer);
        spin_unlock_irqrestore(&shortp_out_lock, flags);
        return;
    }

    /* Otherwise we must have dropped an interrupt. */
    spin_unlock_irqrestore(&shortp_out_lock, flags);
    shortp_interrupt(shortp_irq, NULL, NULL);
}

如果没有输出被认为是活动的,则定时器函数简单地返回;这可以防止计时器在关闭时重新提交自身。然后,拿到锁后,我们查询端口的状态;如果它声称很忙,那么它只是还没有时间来打扰我们,所以我们重置计时器并返回。打印机有时可能需要很长时间才能做好准备;想象一下,当每个人都在长周末外出时,打印机的纸张用完了。在这种情况下,除了耐心等待事情发生变化之外,别无他法。

If no output is supposed to be active, the timer function simply returns; this keeps the timer from resubmitting itself when things are being shut down. Then, after taking the lock, we query the status of the port; if it claims to be busy, it simply hasn't gotten around to interrupting us yet, so we reset the timer and return. Printers can, at times, take a very long time to make themselves ready; consider the printer that runs out of paper while everybody is gone over a long weekend. In such situations, there is nothing to do other than to wait patiently until something changes.

然而,如果打印机声称已准备好,那么我们一定错过了它的中断。在这种情况下,我们只需手动调用中断处理程序即可使输出进程再次移动。

If, however, the printer claims to be ready, we must have missed its interrupt. In that case, we simply invoke our interrupt handler manually to get the output process moving again.

Shortprint驱动不支持从端口读取;相反,它的行为类似于Shortint,并返回中断计时信息。然而,中断驱动的读取方法的实现与我们所看到的非常相似。来自设备的数据将被读入驱动程序缓冲区;仅当缓冲区中积累了大量数据、满足完整读取请求或某些情况时,才会将其复制到用户空间 种类 发生超时。

The shortprint driver does not support reading from the port; instead, it behaves like shortint and returns interrupt timing information. The implementation of an interrupt-driven read method would be very similar to what we have seen, however. Data from the device would be read into a driver buffer; it would be copied out to user space only when a significant amount of data has accumulated in the buffer, the full read request has been satisfied, or some sort of timeout occurs.

快速参考

Quick Reference

本章介绍了这些与中断管理相关的符号:

These symbols related to interrupt management were introduced in this chapter:

#include <linux/interrupt.h>

int request_irq(unsigned int irq, irqreturn_t (*handler)( ), unsigned long

flags, const char *dev_name, void *dev_id);

void free_irq(unsigned int irq, void *dev_id);
#include <linux/interrupt.h>

int request_irq(unsigned int irq, irqreturn_t (*handler)( ), unsigned long

flags, const char *dev_name, void *dev_id);

void free_irq(unsigned int irq, void *dev_id);

调用注册和 取消注册中断处理程序。

Calls that register and unregister an interrupt handler.

#include <linux/irq.h.h>

int can_request_irq(unsigned int irq, unsigned long flags);
#include <linux/irq.h.h>

int can_request_irq(unsigned int irq, unsigned long flags);

该函数在 i386 和 x86_64 体系结构上可用,如果尝试分配给定中断线成功,则返回非零值。

This function, available on the i386 and x86_64 architectures, returns a nonzero value if an attempt to allocate the given interrupt line succeeds.

#include <asm/signal.h>

SA_INTERRUPT

SA_SHIRQ

SA_SAMPLE_RANDOM
#include <asm/signal.h>

SA_INTERRUPT

SA_SHIRQ

SA_SAMPLE_RANDOM

request_irq的标志 。SA_INTERRUPT 请求安装快速处理程序(而不是慢速处理程序)。SA_SHIRQ安装共享处理程序,第三个标志断言中断时间戳可用于生成系统熵。

Flags for request_irq. SA_INTERRUPT requests installation of a fast handler (as opposed to a slow one). SA_SHIRQ installs a shared handler, and the third flag asserts that interrupt timestamps can be used to generate system entropy.

/proc/interrupts

/proc/stat
/proc/interrupts

/proc/stat

报告有关硬件中断和已安装处理程序的信息的文件系统节点。

Filesystem nodes that report information about hardware interrupts and installed handlers.

unsigned long probe_irq_on(void);

int probe_irq_off(unsigned long);
unsigned long probe_irq_on(void);

int probe_irq_off(unsigned long);

驱动程序在必须探测以确定设备正在使用哪条中断线时使用的函数。中断产生后,probe_irq_on的结果必须传回probe_irq_off 。probe_irq_off的返回值是检测到的中断号。

Functions used by the driver when it has to probe to determine which interrupt line is being used by a device. The result of probe_irq_on must be passed back to probe_irq_off after the interrupt has been generated. The return value of probe_irq_off is the detected interrupt number.

IRQ_NONE

IRQ_HANDLED

IRQ_RETVAL(int x)
IRQ_NONE

IRQ_HANDLED

IRQ_RETVAL(int x)

中断处理程序可能返回的值,指示是否存在来自设备的实际中断。

The possible return values from an interrupt handler, indicating whether an actual interrupt from the device was present.

void disable_irq(int irq);

void disable_irq_nosync(int irq);

void enable_irq(int irq);
void disable_irq(int irq);

void disable_irq_nosync(int irq);

void enable_irq(int irq);

驱动程序可以启用和禁用中断报告。如果硬件在中断被禁用时尝试生成中断,则中断将永远丢失。使用共享处理程序的驱动程序不得使用这些函数。

A driver can enable and disable interrupt reporting. If the hardware tries to generate an interrupt while interrupts are disabled, the interrupt is lost forever. A driver using a shared handler must not use these functions.

void local_irq_save(unsigned long flags);

void local_irq_restore(unsigned long flags);
void local_irq_save(unsigned long flags);

void local_irq_restore(unsigned long flags);

使用local_irq_save禁用本地处理器上的中断并记住它们之前的状态。可以flags传递给local_irq_restore以恢复之前的中断状态。

Use local_irq_save to disable interrupts on the local processor and remember their previous state. The flags can be passed to local_irq_restore to restore the previous interrupt state.

void local_irq_disable(void);

void local_irq_enable(void);
void local_irq_disable(void);

void local_irq_enable(void);

无条件禁用和启用当前处理器上的中断的函数。

Functions that unconditionally disable and enable interrupts on the current processor.




[ 1 ]尽管如此,一些较大的系统明确使用中断平衡方案来在整个系统中分散中断负载。

[1] Although, some larger systems explicitly use interrupt balancing schemes to spread the interrupt load across the system.

[ 2 ] shortint设备通过将 0x00 和 0xff 交替写入并行端口来完成其任务。

[2] The shortint device accomplishes its task by alternately writing 0x00 and 0xff to the parallel port.

第 11 章内核中的数据类型

Chapter 11. Data Types in the Kernel

在我们继续讨论更高级的主题之前,我们需要停下来快速了解一下可移植性问题。现代版本的 Linux 内核具有高度可移植性,可以在多种不同的体系结构上运行。鉴于 Linux 的多平台特性,用于严肃用途的驱动程序也应该是可移植的。

Before we go on to more advanced topics, we need to stop for a quick note on portability issues. Modern versions of the Linux kernel are highly portable, running on numerous different architectures. Given the multiplatform nature of Linux, drivers intended for serious use should be portable as well.

但内核代码的一个核心问题是能够访问已知长度的数据项(例如,文件系统数据结构或设备板上的寄存器)并利用不同处理器(32 位和 64 位架构,以及也可能是 16 位)。

But a core issue with kernel code is being able both to access data items of known length (for example, filesystem data structures or registers on device boards) and to exploit the capabilities of different processors (32-bit and 64-bit architectures, and possibly 16 bit as well).

内核开发人员在将 x86 代码移植到新架构时遇到的一些问题与不正确的数据输入有关。遵守严格的数据类型并使用-Wall -Wstrict-prototypes标志进行编译可以防止大多数错误。

Several of the problems encountered by kernel developers while porting x86 code to new architectures have been related to incorrect data typing. Adherence to strict data typing and compiling with the -Wall -Wstrict-prototypes flags can prevent most bugs.

内核数据使用的数据类型分为三个主要类别:标准 C 类型(例如 )int、显式大小的类型(例如u32)以及用于特定内核对象的类型(例如 )pid_t。我们将了解何时以及如何使用这三个类型类。本章的最后几节讨论了将驱动程序代码从 x86 移植到其他平台时可能遇到的一些其他典型问题,并介绍了对最新内核头文件导出的链表的通用支持。

Data types used by kernel data are divided into three main classes: standard C types such as int, explicitly sized types such as u32, and types used for specific kernel objects, such as pid_t. We are going to see when and how each of the three typing classes should be used. The final sections of the chapter talk about some other typical problems you might run into when porting driver code from the x86 to other platforms, and introduce the generalized support for linked lists exported by recent kernel headers.

如果您遵循我们提供的指南,您的驱动程序即使在您无法测试的平台上也应该可以编译和运行。

If you follow the guidelines we provide, your driver should compile and run even on platforms on which you are unable to test it.

标准 C 类型的使用

Use of Standard C Types

虽然大多数程序员都习惯自由地使用标准类型,如intlong,编写设备驱动程序需要小心避免键入冲突和隐藏的错误。

Although most programmers are accustomed to freely using standard types like int and long, writing device drivers requires some care to avoid typing conflicts and obscure bugs.

问题是,当您需要“2 字节填充符”或“表示 4 字节字符串的东西”时,您不能使用标准类型,因为普通 C 数据类型在所有体系结构上的大小并不相同。为了显示各种 C 类型的数据大小, datasize程序已包含在 O'Reilly 的 FTP 站点的Misc-progs目录中提供的示例文件中。这是该程序在 i386 系统上运行的示例(下一节将介绍最后四种类型):

The problem is that you can't use the standard types when you need "a 2-byte filler" or "something representing a 4-byte string," because the normal C data types are not the same size on all architectures. To show the data size of the various C types, the datasize program has been included in the sample files provided on O'Reilly's FTP site in the directory misc-progs. This is a sample run of the program on an i386 system (the last four types shown are introduced in the next section):

莫甘娜%杂项-progs/datasize
拱形大小: char Short int long ptr long-long u8 u16 u32 u64
i686 1 2 4 4 4 8 1 2 4 8
morgana% misc-progs/datasize
arch   Size:  char  short  int  long   ptr long-long  u8 u16 u32 u64
i686            1     2     4     4     4     8        1   2   4   8

该程序可用于显示long整数和指针在 64 位平台上具有不同的大小,如在不同 Linux 计算机上运行该程序所示:

The program can be used to show that long integers and pointers feature a different size on 64-bit platforms, as demonstrated by running the program on different Linux computers:

拱形大小: char Short int long ptr long-long u8 u16 u32 u64
i386 1 2 4 4 4 8 1 2 4 8
阿尔法 1 2 4 8 8 8 1 2 4 8
ARMv4l 1 2 4 4 4 8 1 2 4 8
ia64 1 2 4 8 8 8 1 2 4 8
m68k 1 2 4 4 4 8 1 2 4 8
米普 1 2 4 4 4 8 1 2 4 8
点击数 1 2 4 4 4 8 1 2 4 8
斯帕克 1 2 4 4 4 8 1 2 4 8
sparc64 1 2 4 4 4 8 1 2 4 8
x86_64 1 2 4 8 8 8 1 2 4 8
arch   Size:  char  short  int  long   ptr long-long  u8 u16 u32 u64
i386            1     2     4     4     4     8        1   2   4   8
alpha           1     2     4     8     8     8        1   2   4   8
armv4l          1     2     4     4     4     8        1   2   4   8
ia64            1     2     4     8     8     8        1   2   4   8
m68k            1     2     4     4     4     8        1   2   4   8
mips            1     2     4     4     4     8        1   2   4   8
ppc             1     2     4     4     4     8        1   2   4   8
sparc           1     2     4     4     4     8        1   2   4   8
sparc64         1     2     4     4     4     8        1   2   4   8
x86_64          1     2     4     8     8     8        1   2   4   8

有趣的是,SPARC 64 架构在 32 位用户空间中运行,因此指针在此处为 32 位宽,尽管它们在内核空间中为 64 位宽。这可以通过加载kdatasize模块(可在示例文件中的Misc-modules目录中找到)来验证。该模块使用printk在加载时报告大小信息并返回错误(因此无需卸载它):

It's interesting to note that the SPARC 64 architecture runs with a 32-bit user space, so pointers are 32 bits wide there, even though they are 64 bits wide in kernel space. This can be verified by loading the kdatasize module (available in the directory misc-modules within the sample files). The module reports size information at load time using printk and returns an error (so there's no need to unload it):

内核:arch 大小:char Short int long ptr long-long u8 u16 u32 u64
内核:sparc64 1 2 4 8 8 8 1 2 4 8
kernel: arch   Size:  char short int long  ptr long-long u8 u16 u32 u64
kernel: sparc64         1    2    4    8    8     8       1   2   4   8

尽管混合不同数据类型时必须小心,但有时这样做是有充分理由的。其中一种情况是内存地址,对于内核而言,内存地址是特殊的。尽管从概念上讲,地址是指针,但使用无符号整数类型通常可以更好地完成内存管理;内核将物理内存视为一个巨大的数组,内存地址只是数组的索引。此外,指针很容易被取消引用;当直接处理内存地址时,您几乎不想以这种方式取消引用它们。使用整数类型可以防止这种取消引用,从而避免错误。因此,内核中的通用内存地址通常是unsigned long,利用指针和long整数总是相同的大小,至少在 Linux 目前支持的所有平台上是这样。

Although you must be careful when mixing different data types, sometimes there are good reasons to do so. One such situation is for memory addresses, which are special as far as the kernel is concerned. Although, conceptually, addresses are pointers, memory administration is often better accomplished by using an unsigned integer type; the kernel treats physical memory like a huge array, and a memory address is just an index into the array. Furthermore, a pointer is easily dereferenced; when dealing directly with memory addresses, you almost never want to dereference them in this manner. Using an integer type prevents this dereferencing, thus avoiding bugs. Therefore, generic memory addresses in the kernel are usually unsigned long, exploiting the fact that pointers and long integers are always the same size, at least on all the platforms currently supported by Linux.

无论如何,C99 标准定义了 可以保存指针值的整型变量的intptr_t和类型。uintptr_t然而,这些类型在 2.6 内核中几乎未使用。

For what it's worth, the C99 standard defines the intptr_t and uintptr_t types for an integer variable that can hold a pointer value. These types are almost unused in the 2.6 kernel, however.

为数据项分配显式大小

Assigning an Explicit Size to Data Items

有时内核代码 需要特定大小的数据项,也许是为了匹配预定义的二进制结构,[ 1 ]与用户空间通信,或者通过插入“填充”字段来对齐结构内的数据(但请参阅第11.4.4 节以获取有关对齐问题的信息)。

Sometimes kernel code requires data items of a specific size, perhaps to match predefined binary structures,[1] to communicate with user space, or to align data within structures by inserting "padding" fields (but refer to the Section 11.4.4 for information about alignment issues).

内核提供了以下数据类型,供您在需要了解数据大小时使用。所有类型都在<asm/types.h>中声明,而 <asm/types.h> 又包含在<linux/types.h>中:

The kernel offers the following data types to use whenever you need to know the size of your data. All the types are declared in <asm/types.h>, which, in turn, is included by <linux/types.h>:

u8; /* 无符号字节(8 位) */
u16; /* 无符号字(16 位) */
u32; /* 无符号 32 位值 */
u64; /* 无符号 64 位值 */
u8;   /* unsigned byte (8 bits) */
u16;  /* unsigned word (16 bits) */
u32;  /* unsigned 32-bit value */
u64;  /* unsigned 64-bit value */

相应的有符号类型存在,但很少需要;如果需要,只需在名称中替换u 为即可。s

The corresponding signed types exist, but are rarely needed; just replace u with s in the name if you need them.

如果用户空间程序需要使用这些类型,它可以在名称前添加双下划线:_ _u8并且其他类型的定义独立于_ _KERNEL_ _. 例如,如果驱动程序需要通过 ioctl与用户空间中运行的程序交换二进制结构,则头文件应将结构中的 32 位字段声明为_ _u32.

If a user-space program needs to use these types, it can prefix the names with a double underscore: _ _u8 and the other types are defined independent of _ _KERNEL_ _. If, for example, a driver needs to exchange binary structures with a program running in user space by means of ioctl, the header files should declare 32-bit fields in the structures as _ _u32.

重要的是要记住,这些类型是 Linux 特有的,使用它们会阻碍将软件移植到其他 Unix 版本。具有最新编译器的系统支持 C99 标准类型,例如 uint8_tuint32_t;如果考虑可移植性,可以使用这些类型来代替 Linux 特定的类型。

It's important to remember that these types are Linux specific, and using them hinders porting software to other Unix flavors. Systems with recent compilers support the C99-standard types, such as uint8_t and uint32_t; if portability is a concern, those types can be used in favor of the Linux-specific variety.

您可能还注意到,有时内核unsigned int对维度与体系结构无关的项使用常规类型,例如 。这样做通常是为了向后兼容。当u32和 朋友在版本 1.1.67 中引入时,开发人员无法将现有数据结构更改为新类型,因为当结构字段和分配给它的值之间存在类型不匹配时,编译器会发出警告。[ 2 ] Linus 并没有想到他为自己编写的操作系统(OS)会成为多平台的;因此,旧的结构有时是松散类型的。

You might also note that sometimes the kernel uses conventional types, such as unsigned int, for items whose dimension is architecture independent. This is usually done for backward compatibility. When u32 and friends were introduced in Version 1.1.67, the developers couldn't change existing data structures to the new types because the compiler issues a warning when there is a type mismatch between the structure field and the value being assigned to it.[2] Linus didn't expect the operating system (OS) he wrote for his own use to become multiplatform; as a result, old structures are sometimes loosely typed.

接口特定类型

Interface-Specific Types

一些常见的 内核中使用的数据类型有自己的typedef声明,从而防止任何可移植性问题。例如,进程标识符 (pid) 通常pid_t 代替int. 使用pid_t掩盖实际数据键入中任何可能的差异。我们使用表达式特定于接口来指代库定义的类型,以便为特定数据结构提供接口。

Some of the commonly used data types in the kernel have their own typedef statements, thus preventing any portability problems. For example, a process identifier (pid) is usually pid_t instead of int. Using pid_t masks any possible difference in the actual data typing. We use the expression interface-specific to refer to a type defined by a library in order to provide an interface to a specific data structure.

请注意,最近定义的新的特定于接口的类型相对较少。该语句的使用typedef已经不再受到许多内核开发人员的青睐,他们宁愿看到直接在代码中使用的真实类型信息,而不是隐藏在用户定义类型后面。然而,许多旧的特定于接口的类型仍然保留在内核中,并且它们不会很快消失。

Note that, in recent times, relatively few new interface-specific types have been defined. Use of the typedef statement has gone out of favor among many kernel developers, who would rather see the real type information used directly in the code, rather than hidden behind a user-defined type. Many older interface-specific types remain in the kernel, however, and they will not be going away anytime soon.

即使没有定义特定于接口的类型,以与内核其余部分一致的方式使用正确的数据类型始终很重要。例如,jiffy 计数始终为unsigned long,与其实际大小无关,因此unsigned long在使用 jiffy 时应始终使用该类型。在本节中,我们集中讨论_t 类型的使用。

Even when no interface-specific type is defined, it's always important to use the proper data type in a way consistent with the rest of the kernel. A jiffy count, for instance, is always unsigned long, independent of its actual size, so the unsigned long type should always be used when working with jiffies. In this section we concentrate on use of _t types.

<linux/types.h>_t中定义了许多类型,但该列表很少有用。当您需要特定类型时,您可以在需要调用的函数的原型或使用的数据结构中找到它。

Many _t types are defined in <linux/types.h>, but the list is rarely useful. When you need a specific type, you'll find it in the prototype of the functions you need to call or in the data structures you use.

每当您的驱动程序使用需要此类“自定义”类型的函数并且您不遵循约定时,编译器都会发出警告;如果您使用-Wall编译器标志并小心删除所有警告,您可以确信您的代码是可移植的。

Whenever your driver uses functions that require such "custom" types and you don't follow the convention, the compiler issues a warning; if you use the -Wall compiler flag and are careful to remove all the warnings, you can feel confident that your code is portable.

数据项的主要问题是,当您需要打印它们时,选择正确的printkprintf_t格式 并不总是那么容易 ,并且您在一种体系结构上解决的警告会在另一种体系结构上重新出现。例如,您将如何在 某些平台和其他平台上打印 , ?size_tunsigned longunsigned int

The main problem with _t data items is that when you need to print them, it's not always easy to choose the right printk or printf format, and warnings you resolve on one architecture reappear on another. For example, how would you print a size_t, that is unsigned long on some platforms and unsigned int on some others?

每当您需要打印一些特定于接口的数据时,最好的方法是将值转换为最大可能的类型(通常是longunsigned long),然后通过相应的格式打印它。这种调整不会生成错误或警告,因为格式与类型匹配,并且您不会丢失数据位,因为转换要么是空操作,要么是项目到更大数据类型的扩展。

Whenever you need to print some interface-specific data, the best way to do it is by casting the value to the biggest possible type (usually long or unsigned long) and then printing it through the corresponding format. This kind of tweaking won't generate errors or warnings because the format matches the type, and you won't lose data bits because the cast is either a null operation or an extension of the item to a bigger data type.

实际上,我们正在讨论的数据项通常不打算打印,因此该问题仅适用于调试消息。大多数情况下,除了将它们作为参数传递给库或内核函数之外,代码只需要存储和比较特定于接口的类型。

In practice, the data items we're talking about aren't usually meant to be printed, so the issue applies only to debugging messages. Most often, the code needs only to store and compare the interface-specific types, in addition to passing them as arguments to library or kernel functions.

尽管_t类型是大多数情况下的正确解决方案,但有时正确的类型并不存在。对于一些尚未清理的旧接口,会发生这种情况。

Although _t types are the correct solution for most situations, sometimes the right type doesn't exist. This happens for some old interfaces that haven't yet been cleaned up.

我们在内核头文件中发现的一个模糊点是 I/O 函数的数据类型,它的定义是松散的(参见第9 章中的 9.2.6 节)。松散类型主要是由于历史原因而存在,但在编写代码时可能会产生问题。例如,将参数交换到像outb这样的函数可能会遇到麻烦;如果存在类型,编译器就会发现这种类型的错误。port_t

The one ambiguous point we've found in the kernel headers is data typing for I/O functions, which is loosely defined (see the Section 9.2.6 in Chapter 9). The loose typing is mainly there for historical reasons, but it can create problems when writing code. For example, one can get into trouble by swapping the arguments to functions like outb; if there were a port_t type, the compiler would find this type of error.

其他可移植性问题

Other Portability Issues

除了数据输入之外,如果您希望驱动程序能够跨 Linux 平台移植,那么在编写驱动程序时还需要记住一些其他软件问题。

In addition to data typing, there are a few other software issues to keep in mind when writing a driver if you want it to be portable across Linux platforms.

一般规则是对显式常量值持怀疑态度。通常,代码已使用预处理器宏进行参数化。本节列出了最重要的可移植性问题。每当您遇到其他已参数化的值时,您都可以在头文件和随官方内核分发的设备驱动程序中找到提示。

A general rule is to be suspicious of explicit constant values. Usually the code has been parameterized using preprocessor macros. This section lists the most important portability problems. Whenever you encounter other values that have been parameterized, you can find hints in the header files and in the device drivers distributed with the official kernel.

时间间隔

Time Intervals

处理时间间隔时,不要假设每秒有 1000 jiffies。虽然目前 i386 架构确实如此,但并不是每个 Linux 平台都以这个速度运行。HZ如果您使用该值(就像某些人所做的那样),即使对于 x86,该假设也可能是错误的,并且没有人知道未来内核中会发生什么。HZ每当您使用 jiffies 计算时间间隔时,请使用(每秒计时器中断数)缩放时间。例如,要检查半秒超时,请将经过的时间与 进行比较HZ/2。更一般地,与毫秒相对应的 jiffies 数msec始终为 msec*HZ/1000

When dealing with time intervals, don't assume that there are 1000 jiffies per second. Although this is currently true for the i386 architecture, not every Linux platform runs at this speed. The assumption can be false even for the x86 if you play with the HZ value (as some people do), and nobody knows what will happen in future kernels. Whenever you calculate time intervals using jiffies, scale your times using HZ (the number of timer interrupts per second). For example, to check against a timeout of half a second, compare the elapsed time against HZ/2. More generally, the number of jiffies corresponding to msec milliseconds is always msec*HZ/1000.

页面大小

Page Size

当玩内存游戏时,请记住内存页是PAGE_SIZE字节,而不是 4 KB。假设页面大小为 4 KB 并对值进行硬编码是 PC 程序员中的常见错误,相反,支持的平台显示的页面大小从 4 KB 到 64 KB,有时同一平台的不同实现之间的页面大小有所不同。相关的宏是PAGE_SIZEPAGE_SHIFT。后者包含移位地址以获得其页码的位数。对于 4 KB 及以上的页面,当前数量为 12 或更多。宏定义在<asm/page.h>中;如果用户空间程序需要该信息,则可以使用getpagesize库函数。

When playing games with memory, remember that a memory page is PAGE_SIZE bytes, not 4 KB. Assuming that the page size is 4 KB and hardcoding the value is a common error among PC programmers, instead, supported platforms show page sizes from 4 KB to 64 KB, and sometimes they differ between different implementations of the same platform. The relevant macros are PAGE_SIZE and PAGE_SHIFT. The latter contains the number of bits to shift an address to get its page number. The number currently is 12 or greater for pages that are 4 KB and larger. The macros are defined in <asm/page.h>; user-space programs can use the getpagesize library function if they ever need the information.

让我们看一个重要的情况。如果驱动程序需要 16 KB 的临时数据,则不应为get_free_pagesorder指定of 。您需要一个便携式解决方案。幸运的是,这样的解决方案已经由内核开发人员编写,称为 get_order2

Let's look at a nontrivial situation. If a driver needs 16 KB for temporary data, it shouldn't specify an order of 2 to get_free_pages. You need a portable solution. Such a solution, fortunately, has been written by the kernel developers and is called get_order:

#include <asm/page.h>
int order = get_order(16*1024);
buf = get_free_pages(GFP_KERNEL, 顺序);
#include <asm/page.h>
int order = get_order(16*1024);
buf = get_free_pages(GFP_KERNEL, order);

请记住, get_order的参数必须是 2 的幂。

Remember that the argument to get_order must be a power of two.

字节顺序

Byte Order

请注意不要对字节顺序做出假设。PC 以低字节优先(小端优先,因此为小端)存储多字节值,而某些高级平台则以相反的方式工作(大端)。只要有可能,您的代码就应该编写成不关心它所操作的数据中的字节顺序。但是,有时驱动程序需要用单个字节构建整数或执行相反的操作,或者它必须与需要特定顺序的设备进行通信。

Be careful not to make assumptions about byte ordering. Whereas the PC stores multibyte values low-byte first (little end first, thus little-endian), some high-level platforms work the other way (big-endian). Whenever possible, your code should be written such that it does not care about byte ordering in the data it manipulates. However, sometimes a driver needs to build an integer number out of single bytes or do the opposite, or it must communicate with a device that expects a specific order.

包含文件<asm/byteorder.h>定义 或_ _BIG_ENDIAN_ _LITTLE_ENDIAN具体取决于处理器的字节顺序。在处理字节顺序问题时,您可以编写一堆#ifdef _ _LITTLE_ENDIAN条件语句,但有更好的方法。Linux 内核定义了一组宏,用于处理处理器的字节顺序与需要以特定字节顺序存储或加载的数据之间的转换。例如:

The include file <asm/byteorder.h> defines either _ _BIG_ENDIAN or _ _LITTLE_ENDIAN, depending on the processor's byte ordering. When dealing with byte ordering issues, you could code a bunch of #ifdef _ _LITTLE_ENDIAN conditionals, but there is a better way. The Linux kernel defines a set of macros that handle conversions between the processor's byte ordering and that of the data you need to store or load in a specific byte order. For example:

u32 cpu_to_le32 (u32);
u32 le32_to_cpu (u32);
u32 cpu_to_le32 (u32);
u32 le32_to_cpu (u32);

这两个宏将 CPU 使用的值转换为无符号、小端、32 位数量,然后再转换回来。无论您的 CPU 是大端还是小端,也无论它是否是 32 位处理器,它们都可以工作。在没有工作要做的情况下,他们会原封不动地返回他们的论点。使用这些宏可以轻松编写可移植代码,而无需使用大量条件编译结构。

These two macros convert a value from whatever the CPU uses to an unsigned, little-endian, 32-bit quantity and back. They work whether your CPU is big-endian or little-endian and, for that matter, whether it is a 32-bit processor or not. They return their argument unchanged in cases where there is no work to be done. Use of these macros makes it easy to write portable code without having to use a lot of conditional compilation constructs.

类似的套路还有几十个;您可以在<linux/byteorder/big_endian.h><linux/byteorder/little_endian.h>中看到完整列表。一段时间后,这个模式就不难遵循了。be64_to_cpu将无符号、大端、64 位值转换为内部 CPU 表示形式。 相反, le16_to_cpus处理有符号、小端、16 位数量。处理指针时,您还可以使用 cpu_to_le32p等函数,它采用指向要转换的值的指针而不是值本身。其余部分请参阅包含文件。

There are dozens of similar routines; you can see the full list in <linux/byteorder/big_endian.h> and <linux/byteorder/little_endian.h>. After a while, the pattern is not hard to follow. be64_to_cpu converts an unsigned, big-endian, 64-bit value to the internal CPU representation. le16_to_cpus, instead, handles signed, little-endian, 16-bit quantities. When dealing with pointers, you can also use functions like cpu_to_le32p, which take a pointer to the value to be converted rather than the value itself. See the include file for the rest.

数据对齐

Data Alignment

编写可移植代码时值得考虑的最后一个问题是如何访问未对齐的数据,例如,如何读取存储在不是 4 字节倍数的地址处的 4 字节值。i386 用户经常访问未对齐的数据项,但并非所有体系结构都允许这样做。许多现代架构在每次程序尝试未对齐的数据传输时都会生成异常;数据传输由异常处理程序处理,但性能会受到很大影响。如果需要访问未对齐的数据,应使用以下宏:

The last problem worth considering when writing portable code is how to access unaligned data—for example, how to read a 4-byte value stored at an address that isn't a multiple of 4 bytes. i386 users often access unaligned data items, but not all architectures permit it. Many modern architectures generate an exception every time the program tries unaligned data transfers; data transfer is handled by the exception handler, with a great performance penalty. If you need to access unaligned data, you should use the following macros:

#include <asm/unaligned.h>
get_unaligned(ptr);
put_unaligned(val, ptr);
#include <asm/unaligned.h>
get_unaligned(ptr);
put_unaligned(val, ptr);

这些宏是无类型的,适用于每个数据项,无论其长度是一、二、四还是八字节。它们是用任何内核版本定义的。

These macros are typeless and work for every data item, whether it's one, two, four, or eight bytes long. They are defined with any kernel version.

与对齐相关的另一个问题是数据结构跨平台的可移植性。相同的数据结构(如 C 语言源文件中定义的)可以在不同平台上进行不同的编译。编译器根据平台之间不同的约定来排列结构体字段。

Another issue related to alignment is portability of data structures across platforms. The same data structure (as defined in the C-language source file) can be compiled differently on different platforms. The compiler arranges structure fields to be aligned according to conventions that differ from platform to platform.

为了为可以跨架构移动的数据项编写数据结构,除了对特定字节顺序进行标准化之外,您还应该始终强制数据项的自然对齐。 自然对齐意味着将数据项存储在其大小倍数的地址中(例如,8 字节项存储在 8 倍数的地址中)。为了强制自然对齐,同时防止编译器以不可预测的方式排列字段,您应该使用填充字段,以避免在数据结构中留下漏洞。

In order to write data structures for data items that can be moved across architectures, you should always enforce natural alignment of the data items in addition to standardizing on a specific endianness. Natural alignment means storing data items at an address that is a multiple of their size (for instance, 8-byte items go in an address multiple of 8). To enforce natural alignment while preventing the compiler to arrange the fields in unpredictable ways, you should use filler fields that avoid leaving holes in the data structure.

为了展示编译器如何强制对齐,dataalign程序分布在示例代码的 misc-progs目录中,并且等效的kdataalign模块是misc-modules的一部分。这是程序在多个平台上的输出以及 SPARC64 上模块的输出:

To show how alignment is enforced by the compiler, the dataalign program is distributed in the misc-progs directory of the sample code, and an equivalent kdataalign module is part of misc-modules. This is the output of the program on several platforms and the output of the module on the SPARC64:

拱形对齐: char Short int long ptr long-long u8 u16 u32 u64
i386 1 2 4 4 4 4 1 2 4 4
i686 1 2 4 4 4 4 1 2 4 4
阿尔法 1 2 4 8 8 8 1 2 4 8
ARMv4l 1 2 4 4 4 4 1 2 4 4
ia64 1 2 4 8 8 8 1 2 4 8
米普 1 2 4 4 4 8 1 2 4 8
点击数 1 2 4 4 4 8 1 2 4 8
斯帕克 1 2 4 4 4 8 1 2 4 8
sparc64 1 2 4 4 4 8 1 2 4 8
x86_64 1 2 4 8 8 8 1 2 4 8

内核: arch 对齐: char Short int long ptr long-long u8 u16 u32 u64
内核:sparc64 1 2 4 8 8 8 1 2 4 8
arch  Align:  char  short  int  long   ptr long-long  u8 u16 u32 u64
i386            1     2     4     4     4     4        1   2   4   4
i686            1     2     4     4     4     4        1   2   4   4
alpha           1     2     4     8     8     8        1   2   4   8
armv4l          1     2     4     4     4     4        1   2   4   4
ia64            1     2     4     8     8     8        1   2   4   8
mips            1     2     4     4     4     8        1   2   4   8
ppc             1     2     4     4     4     8        1   2   4   8
sparc           1     2     4     4     4     8        1   2   4   8
sparc64         1     2     4     4     4     8        1   2   4   8
x86_64          1     2     4     8     8     8        1   2   4   8

kernel: arch  Align: char short int long  ptr long-long u8 u16 u32 u64
kernel: sparc64        1    2    4    8    8     8       1   2   4   8

值得注意的是,并非所有平台都在 64 位边界上对齐 64 位值,因此您需要填充字段来强制对齐并确保可移植性。

It's interesting to note that not all platforms align 64-bit values on 64-bit boundaries, so you need filler fields to enforce alignment and ensure portability.

最后,请注意,编译器可能会悄悄地将填充插入到结构本身中,以确保每个字段都对齐,以便在目标处理器上获得良好的性能。如果您定义的结构旨在与设备所需的结构相匹配,则此自动填充可能会阻碍您的尝试。解决这个问题的方法是告诉编译器该结构必须被“打包”,并且不添加任何填充物。例如,内核头文件<linux/edd.h>定义了与 x86 BIOS 接口的几个数据结构,它包括以下定义:

Finally, be aware that the compiler may quietly insert padding into structures itself to ensure that every field is aligned for good performance on the target processor. If you are defining a structure that is intended to match a structure expected by a device, this automatic padding may thwart your attempt. The way around this problem is to tell the compiler that the structure must be "packed," with no fillers added. For example, the kernel header file <linux/edd.h> defines several data structures used in interfacing with the x86 BIOS, and it includes the following definition:

结构体{
        u16 ID;
        u64 伦;
        u16 保留1;
        u32 保留2;
} _ _attribute__ ((打包)) scsi;
struct {
        u16 id;
        u64 lun;
        u16 reserved1;
        u32 reserved2;
} _ _attribute_ _ ((packed)) scsi;

如果没有_ _attribute_ _ ((packed)),该 lun字段前面将有两个填充字节,如果我们在 64 位平台上编译该结构,则该字段前面将有两个填充字节。

Without the _ _attribute_ _ ((packed)), the lun field would be preceded by two filler bytes or six if we compile the structure on a 64-bit platform.

指针和错误值

Pointers and Error Values

许多内部内核函数向调用者返回指针值。其中许多功能也可能会失败。在 大多数情况下,失败是通过返回NULL指针值来指示的。这种技术有效,但无法传达问题的确切性质。有些接口确实需要返回实际的错误代码,以便调用者可以根据实际发生的错误做出正确的决定。

Many internal kernel functions return a pointer value to the caller. Many of those functions can also fail. In most cases, failure is indicated by returning a NULL pointer value. This technique works, but it is unable to communicate the exact nature of the problem. Some interfaces really need to return an actual error code so that the caller can make the right decision based on what actually went wrong.

许多内核接口通过将错误代码编码在指针值中来返回此信息。必须谨慎使用此类函数,因为它们的返回值不能简单地与 进行比较NULL。为了帮助创建和使用此类接口,提供了一小部分函数(在<linux/err.h>中)。

A number of kernel interfaces return this information by encoding the error code in a pointer value. Such functions must be used with care, since their return value cannot simply be compared against NULL. To help in the creation and use of this sort of interface, a small set of functions has been made available (in <linux/err.h>).

返回指针类型的函数可以通过以下方式返回错误值:

A function returning a pointer type can return an error value with:

无效*ERR_PTR(长错误);
void *ERR_PTR(long error);

其中error是通常的负错误代码。调用者可以使用IS_ERR来测试返回的指针是否是错误代码:

where error is the usual negative error code. The caller can use IS_ERR to test whether a returned pointer is an error code or not:

长 IS_ERR(const void *ptr);
long IS_ERR(const void *ptr);

如果您需要实际的错误代码,可以使用以下命令提取:

If you need the actual error code, it can be extracted with:

长 PTR_ERR(const void *ptr);
long PTR_ERR(const void *ptr);

您应该仅对IS_ERR返回真值的值 使用PTR_ERR ;任何其他值都是有效的指针。

You should use PTR_ERR only on a value for which IS_ERR returns a true value; any other value is a valid pointer.

链表

Linked Lists

操作系统与许多其他程序一样,内核通常需要维护数据结构列表。Linux 内核有时会同时托管多个链表实现。为了减少重复代码的数量,内核开发人员创建了循环双向链表的标准实现;鼓励其他需要操作列表的人使用此工具。

Operating system kernels, like many other programs, often need to maintain lists of data structures. The Linux kernel has, at times, been host to several linked list implementations at the same time. To reduce the amount of duplicated code, the kernel developers have created a standard implementation of circular, doubly linked lists; others needing to manipulate lists are encouraged to use this facility.

使用链表接口时,您应该始终记住列表函数不执行锁定。如果您的驱动程序有可能尝试在同一列表上执行并发操作,则您有责任实施锁定方案。替代方案(损坏的列表结构、数据丢失、内核恐慌)往往难以诊断。

When working with the linked list interface, you should always bear in mind that the list functions perform no locking. If there is a possibility that your driver could attempt to perform concurrent operations on the same list, it is your responsibility to implement a locking scheme. The alternatives (corrupted list structures, data loss, kernel panics) tend to be difficult to diagnose.

要使用列表机制,您的驱动程序必须包含文件<linux/list.h>。该文件定义了类型的简单结构list_head

To use the list mechanism, your driver must include the file <linux/list.h>. This file defines a simple structure of type list_head:

结构列表头{
    struct list_head *下一个,*上一个;
};
struct list_head {
    struct list_head *next, *prev;
};

实际代码中使用的链表几乎总是由某种类型的结构组成,每个结构都描述列表中的一个条目。要在代码中使用 Linux 列表工具,您只需list_head在构成列表的结构中嵌入 a 即可。例如,如果您的驱动程序维护了一份要做的事情列表,那么它的声明将如下所示:

Linked lists used in real code are almost invariably made up of some type of structure, each one describing one entry in the list. To use the Linux list facility in your code, you need only embed a list_head inside the structures that make up the list. If your driver maintains a list of things to do, say, its declaration would look something like this:

结构 todo_struct {
    struct list_head 列表;
    int 优先级;/* 特定于驱动程序 */
    /* ...添加其他特定于驱动程序的字段*/
};
struct todo_struct {
    struct list_head list;
    int priority; /* driver specific */
    /* ... add other driver-specific fields */
};

列表的头部通常是一个独立的list_head结构。 图 11-1显示了如何使用 simplestruct list_head来维护数据结构列表。

The head of the list is usually a standalone list_head structure. Figure 11-1 shows how the simple struct list_head is used to maintain a list of data structures.

list_head数据结构

图 11-1。list_head数据结构

Figure 11-1. The list_head data structure

在与宏一起使用之前必须初始化列表头INIT_LIST_HEAD。可以使用以下方式声明和初始化“要做的事情”列表头:

List heads must be initialized prior to use with the INIT_LIST_HEAD macro. A "things to do" list head could be declared and initialized with:

结构list_head todo_list;

INIT_LIST_HEAD(&todo_list);
struct list_head todo_list;

INIT_LIST_HEAD(&todo_list);

或者,可以在编译时初始化列表:

Alternatively, lists can be initialized at compile time:

LIST_HEAD(todo_list);
LIST_HEAD(todo_list);

<linux/list.h>中定义了几个与列表一起使用的函数:

Several functions are defined in <linux/list.h> that work with lists:

list_add(struct list_head *new, struct list_head *head);
list_add(struct list_head *new, struct list_head *head);

立即在列表头之后添加 new条目 - 通常在列表的开头。因此,它可以用来构建堆栈。但请注意,不一定head是列表的名义头;如果您传递的list_head 结构恰好位于列表中间的某个位置,则新条目将紧随其后。由于 Linux 列表是循环的,因此列表的头部通常与任何其他条目没有什么不同。

Adds the new entry immediately after the list head—normally at the beginning of the list. Therefore, it can be used to build stacks. Note, however, that the head need not be the nominal head of the list; if you pass a list_head structure that happens to be in the middle of the list somewhere, the new entry goes immediately after it. Since Linux lists are circular, the head of the list is not generally different from any other entry.

list_add_tail(struct list_head *new, struct list_head *head);
list_add_tail(struct list_head *new, struct list_head *head);

在给定列表头之前(换句话说,在列表末尾)添加一个新条目。 因此, list_add_tail可用于构建先进先出队列。

Adds a new entry just before the given list head—at the end of the list, in other words. list_add_tail can, thus, be used to build first-in first-out queues.

list_del(struct list_head *entry);

list_del_init(struct list_head *entry);
list_del(struct list_head *entry);

list_del_init(struct list_head *entry);

给定的条目将从列表中删除。如果该条目可能会被重新插入到另一个列表中,则应该使用list_del_init,它会重新初始化链表指针。

The given entry is removed from the list. If the entry might ever be reinserted into another list, you should use list_del_init, which reinitializes the linked list pointers.

list_move(struct list_head *entry, struct list_head *head);

list_move_tail(struct list_head *entry, struct list_head *head);
list_move(struct list_head *entry, struct list_head *head);

list_move_tail(struct list_head *entry, struct list_head *head);

给定值entry将从其当前列表中删除并添加到 的开头head。要将条目放在新列表的末尾,请改用list_move_tail

The given entry is removed from its current list and added to the beginning of head. To put the entry at the end of the new list, use list_move_tail instead.

list_empty(struct list_head *head);
list_empty(struct list_head *head);

如果给定列表为空,则返回非零值。

Returns a nonzero value if the given list is empty.

list_splice(struct list_head *list, struct list_head *head);
list_splice(struct list_head *list, struct list_head *head);

list通过在 后立即 插入来连接两个列表head

Joins two lists by inserting list immediately after head.

这些list_head结构非常适合实现类似结构的列表,但调用程序通常对构成整个列表的较大结构更感兴趣。提供了一个宏list_entry,它将list_head结构指针映射回指向包含它的结构的指针。它的调用方式如下:

The list_head structures are good for implementing a list of like structures, but the invoking program is usually more interested in the larger structures that make up the list as a whole. A macro, list_entry, is provided that maps a list_head structure pointer back into a pointer to the structure that contains it. It is invoked as follows:

list_entry(struct list_head *ptr, type_of_struct, field_name);
list_entry(struct list_head *ptr, type_of_struct, field_name);

其中ptr是指向正在使用的 的指针struct list_headtype_of_struct是包含 的结构的类型ptrfield_name是结构中列表字段的名称。在我们todo_struct之前的结构中,列表字段被简单地称为list。因此,我们可以使用如下行将列表条目转换为其包含结构:

where ptr is a pointer to the struct list_head being used, type_of_struct is the type of the structure containing the ptr, and field_name is the name of the list field within the structure. In our todo_struct structure from before, the list field is called simply list. Thus, we would turn a list entry into its containing structure with a line such as:

结构 todo_struct *todo_ptr =
    list_entry(listptr, 结构 todo_struct, 列表);
struct todo_struct *todo_ptr =
    list_entry(listptr, struct todo_struct, list);

list_entry宏需要一些时间来适应但使用起来并不难。

The list_entry macro takes a little getting used to but is not that hard to use.

链表的遍历很简单:只需遵循prevnext指针即可。举个例子,假设我们想要保持todo_struct项目列表按优先级降序排序。添加新条目的函数如下所示:

The traversal of linked lists is easy: one need only follow the prev and next pointers. As an example, suppose we want to keep the list of todo_struct items sorted in descending priority order. A function to add a new entry would look something like this:

无效 todo_add_entry(struct todo_struct *new)
{
    结构list_head * ptr;
    结构 todo_struct *entry;

    for (ptr = todo_list.next; ptr != &todo_list; ptr = ptr->next) {
        条目 = list_entry(ptr, struct todo_struct, 列表);
        if (条目->优先级 < 新->优先级) {
            list_add_tail(&new->list, ptr);
            返回;
        }
    }
    list_add_tail(&new->list, &todo_struct)
}
void todo_add_entry(struct todo_struct *new)
{
    struct list_head *ptr;
    struct todo_struct *entry;

    for (ptr = todo_list.next; ptr != &todo_list; ptr = ptr->next) {
        entry = list_entry(ptr, struct todo_struct, list);
        if (entry->priority < new->priority) {
            list_add_tail(&new->list, ptr);
            return;
        }
    }
    list_add_tail(&new->list, &todo_struct)
}

但是,作为一般规则,最好使用一组预定义宏中的一个来创建迭代列表的循环。例如,前面的循环可以编码为:

However, as a general rule, it is better to use one of a set of predefined macros for creating loops that iterate through lists. The previous loop, for example, could be coded as:

无效 todo_add_entry(struct todo_struct *new)
{
    结构list_head * ptr;
    结构 todo_struct *entry;

    list_for_each(ptr, &todo_list) {
        条目 = list_entry(ptr, struct todo_struct, 列表);
        if (条目->优先级 < 新->优先级) {
            list_add_tail(&new->list, ptr);
            返回;
        }
    }
    list_add_tail(&new->list, &todo_struct)
}
void todo_add_entry(struct todo_struct *new)
{
    struct list_head *ptr;
    struct todo_struct *entry;

    list_for_each(ptr, &todo_list) {
        entry = list_entry(ptr, struct todo_struct, list);
        if (entry->priority < new->priority) {
            list_add_tail(&new->list, ptr);
            return;
        }
    }
    list_add_tail(&new->list, &todo_struct)
}

使用提供的宏有助于避免简单的编程错误;这些宏的开发人员也付出了一些努力来确保它们表现良好。存在一些变体:

Using the provided macros helps avoid simple programming errors; the developers of these macros have also put some effort into ensuring that they perform well. A few variants exist:

list_for_each(struct list_head *cursor, struct list_head *list)
list_for_each(struct list_head *cursor, struct list_head *list)

该宏创建一个for循环,该循环执行一次并cursor指向列表中的每个连续条目。在迭代列表时要小心更改列表。

This macro creates a for loop that executes once with cursor pointing at each successive entry in the list. Be careful about changing the list while iterating through it.

list_for_each_prev(struct list_head *cursor, struct list_head *list)
list_for_each_prev(struct list_head *cursor, struct list_head *list)

此版本向后迭代列表。

This version iterates backward through the list.

list_for_each_safe(struct list_head *cursor, struct list_head *next, struct

list_head *list)
list_for_each_safe(struct list_head *cursor, struct list_head *next, struct

list_head *list)

如果您的循环可能会删除列表中的条目,请使用此版本。next它只是在循环开始时存储列表中的下一个条目,因此如果cursor删除指向的条目,它不会感到困惑。

If your loop may delete entries in the list, use this version. It simply stores the next entry in the list in next at the beginning of the loop, so it does not get confused if the entry pointed to by cursor is deleted.

list_for_each_entry(type *cursor, struct list_head *list, member)

list_for_each_entry_safe(type *cursor, type *next, struct list_head *list,

member)
list_for_each_entry(type *cursor, struct list_head *list, member)

list_for_each_entry_safe(type *cursor, type *next, struct list_head *list,

member)

这些宏简化了处理包含给定结构的列表的过程type。这里,cursor是指向包含结构类型的指针,是包含结构中的结构member名称。list_head使用这些宏,无需将list_entry调用放入循环内。

These macros ease the process of dealing with a list containing a given type of structure. Here, cursor is a pointer to the containing structure type, and member is the name of the list_head structure within the containing structure. With these macros, there is no need to put list_entry calls inside the loop.

如果您查看<linux/list.h>内部,您会看到一些附加声明。类型hlist是一个具有单独的单指针列表头类型的双向链表;它通常用于创建哈希表和类似结构。还有一些宏用于迭代两种类型的列表,这些列表旨在与读取-复制-更新机制一起使用(在第 5 章5.7.5 节中描述)。这些原语不太可能在设备驱动程序中有用;如果您想了解有关如何操作的更多信息,请参阅头文件他们工作。

If you look inside <linux/list.h>, you see some additional declarations. The hlist type is a doubly linked list with a separate, single-pointer list head type; it is often used for creation of hash tables and similar structures. There are also macros for iterating through both types of lists that are intended to work with the read-copy-update mechanism (described in Section 5.7.5 in Chapter 5). These primitives are unlikely to be useful in device drivers; see the header file if you would like more information on how they work.

快速参考

Quick Reference

本章介绍了以下符号:

The following symbols were introduced in this chapter:

#include <linux/types.h>

typedef u8;

typedef u16;

typedef u32;

typedef u64;
#include <linux/types.h>

typedef u8;

typedef u16;

typedef u32;

typedef u64;

类型保证为 8 位、16 位、32 位和 64 位无符号整数值。等效的有符号类型也存在。在用户空间中,您可以将类型称为 _ _u8_ _u16等。

Types guaranteed to be 8-, 16-, 32-, and 64-bit unsigned integer values. The equivalent signed types exist as well. In user space, you can refer to the types as _ _u8, _ _u16, and so forth.

#include <asm/page.h>

PAGE_SIZE

PAGE_SHIFT
#include <asm/page.h>

PAGE_SIZE

PAGE_SHIFT

符号表示定义当前体系结构每页的字节数以及页偏移中的位数(4 KB 页为 12,8 KB 页为 13)。

Symbols that define the number of bytes per page for the current architecture and the number of bits in the page offset (12 for 4-KB pages and 13 for 8-KB pages).

#include <asm/byteorder.h>

_ _LITTLE_ENDIAN

_ _BIG_ENDIAN
#include <asm/byteorder.h>

_ _LITTLE_ENDIAN

_ _BIG_ENDIAN

根据架构,仅定义两个符号之一。

Only one of the two symbols is defined, depending on the architecture.

#include <asm/byteorder.h>

u32 _ _cpu_to_le32 (u32);

u32 _ _le32_to_cpu (u32);
#include <asm/byteorder.h>

u32 _ _cpu_to_le32 (u32);

u32 _ _le32_to_cpu (u32);

已知之间转换的函数字节顺序和处理器的字节顺序。此类函数有60多个;有关完整列表及其定义方式,请参阅include/linux/byteorder/中的各个文件 。

Functions that convert between known byte orders and that of the processor. There are more than 60 such functions; see the various files in include/linux/byteorder/ for a full list and the ways in which they are defined.

#include <asm/unaligned.h>

get_unaligned(ptr);

put_unaligned(val, ptr);
#include <asm/unaligned.h>

get_unaligned(ptr);

put_unaligned(val, ptr);

一些架构需要保护 使用这些宏进行未对齐的数据访问。这些宏扩展到允许您访问未对齐数据的体系结构的正常指针取消引用。

Some architectures need to protect unaligned data access using these macros. The macros expand to normal pointer dereferencing for architectures that permit you to access unaligned data.

#include <linux/err.h>

void *ERR_PTR(long error);

long PTR_ERR(const void *ptr);

long IS_ERR(const void *ptr);
#include <linux/err.h>

void *ERR_PTR(long error);

long PTR_ERR(const void *ptr);

long IS_ERR(const void *ptr);

函数允许返回指针值的函数返回错误代码。

Functions allow error codes to be returned by functions that return a pointer value.

#include <linux/list.h>

list_add(struct list_head *new, struct list_head *head);

list_add_tail(struct list_head *new, struct list_head *head);

list_del(struct list_head *entry);

list_del_init(struct list_head *entry);

list_empty(struct list_head *head);

list_entry(entry, type, member);

list_move(struct list_head *entry, struct list_head *head);

list_move_tail(struct list_head *entry, struct list_head *head);

list_splice(struct list_head *list, struct list_head *head);
#include <linux/list.h>

list_add(struct list_head *new, struct list_head *head);

list_add_tail(struct list_head *new, struct list_head *head);

list_del(struct list_head *entry);

list_del_init(struct list_head *entry);

list_empty(struct list_head *head);

list_entry(entry, type, member);

list_move(struct list_head *entry, struct list_head *head);

list_move_tail(struct list_head *entry, struct list_head *head);

list_splice(struct list_head *list, struct list_head *head);

操纵圆形的函数,双向链表。

Functions that manipulate circular, doubly linked lists.

list_for_each(struct list_head *cursor, struct list_head *list)

list_for_each_prev(struct list_head *cursor, struct list_head *list)

list_for_each_safe(struct list_head *cursor, struct list_head *next, struct

list_head *list)

list_for_each_entry(type *cursor, struct list_head *list, member)

list_for_each_entry_safe(type *cursor, type *next struct list_head *list,

member)
list_for_each(struct list_head *cursor, struct list_head *list)

list_for_each_prev(struct list_head *cursor, struct list_head *list)

list_for_each_safe(struct list_head *cursor, struct list_head *next, struct

list_head *list)

list_for_each_entry(type *cursor, struct list_head *list, member)

list_for_each_entry_safe(type *cursor, type *next struct list_head *list,

member)

用于迭代链接列表的便捷宏。

Convenience macros for iterating through linked lists.




[ 1 ]当读取分区表、执行二进制文件或解码网络数据包时,会发生这种情况。

[1] This happens when reading partition tables, when executing a binary file, or when decoding a network packet.

[ 2 ]事实上,即使这两种类型只是同一对象的不同名称,例如PC 上的unsigned long和 ,编译器也会发出类型不一致的信号。u32

[2] As a matter of fact, the compiler signals type inconsistencies even if the two types are just different names for the same object, such as unsigned long and u32 on the PC.

第 12 章 PCI 驱动程序

Chapter 12. PCI Drivers

虽然第 9 章介绍了最低级别的硬件控制,但本章提供了较高级别总线体系结构的概述。总线由电气接口和编程接口组成。在本章中,我们将讨论编程接口。

While Chapter 9 introduced the lowest levels of hardware control, this chapter provides an overview of the higher-level bus architectures. A bus is made up of both an electrical interface and a programming interface. In this chapter, we deal with the programming interface.

本章涵盖了一些内容 总线架构。然而,主要关注点是访问外围组件互连 (PCI) 外设的内核功能,因为如今 PCI 总线是台式机和大型计算机上最常用的外设总线。该总线是内核支持最好的总线。ISA 对于电子爱好者来说仍然很常见,稍后会进行描述,尽管它几乎是一种裸机总线,并且除了第 9 章和第 10 章中介绍的内容之外,没有太多可

This chapter covers a number of bus architectures. However, the primary focus is on the kernel functions that access Peripheral Component Interconnect (PCI) peripherals, because these days the PCI bus is the most commonly used peripheral bus on desktops and bigger computers. The bus is the one that is best supported by the kernel. ISA is still common for electronic hobbyists and is described later, although it is pretty much a bare-metal kind of bus, and there isn't much to say in addition to what is covered in Chapter 9 and Chapter 10.

PCI接口

The PCI Interface

虽然很多电脑 用户认为 PCI 是一种布置电线的方式,它实际上是一套完整的规范,定义了计算机的不同部分如何交互。

Although many computer users think of PCI as a way of laying out electrical wires, it is actually a complete set of specifications defining how different parts of a computer should interact.

PCI 规范涵盖了与计算机接口相关的大多数问题。我们不打算在这里涵盖所有内容;在本节中,我们主要关注 PCI 驱动程序如何找到其硬件并访问它。第 12 章和第 10 章中讨论的探测技术可以与 PCI 设备一起使用,但规范提供了比探测更好的替代方法。

The PCI specification covers most issues related to computer interfaces. We are not going to cover it all here; in this section, we are mainly concerned with how a PCI driver can find its hardware and gain access to it. The probing techniques discussed in Chapter 12 and Chapter 10 can be used with PCI devices, but the specification offers an alternative that is preferable to probing.

PCI 架构被设计为 ISA 标准的替代品,具有三个主要目标:在计算机及其外设之间传输数据时获得更好的性能,尽可能独立于平台,以及简化向系统添加和删除外设的过程。

The PCI architecture was designed as a replacement for the ISA standard, with three main goals: to get better performance when transferring data between the computer and its peripherals, to be as platform independent as possible, and to simplify adding and removing peripherals to the system.

PCI总线通过使用比ISA更高的时钟速率来实现更好的性能;它的时钟运行频率为 25 或 33 MHz(其实际速率是系统时钟的一个因素),并且最近也部署了 66 MHz 甚至 133 MHz 的实现。而且,它配备了32位数据总线,并且64位扩展已包含在规范中。平台独立性通常是计算机总线设计的一个目标,也是 PCI 的一个特别重要的特性,因为 PC 世界一直由处理器特定的接口标准主导。PCI 目前广泛用于 IA-32、Alpha、PowerPC、SPARC64 和 IA-64 系统以及其他一些平台。

The PCI bus achieves better performance by using a higher clock rate than ISA; its clock runs at 25 or 33 MHz (its actual rate being a factor of the system clock), and 66-MHz and even 133-MHz implementations have recently been deployed as well. Moreover, it is equipped with a 32-bit data bus, and a 64-bit extension has been included in the specification. Platform independence is often a goal in the design of a computer bus, and it's an especially important feature of PCI, because the PC world has always been dominated by processor-specific interface standards. PCI is currently used extensively on IA-32, Alpha, PowerPC, SPARC64, and IA-64 systems, and some other platforms as well.

然而,与驱动程序编写者最相关的是 PCI 对接口板自动检测的支持。PCI 设备是无跳线的(与大多数较旧的外设不同),并且在启动时自动配置。然后,设备驱动程序必须能够访问设备中的配置信息才能完成初始化。这种情况无需执行任何探测即可发生。

What is most relevant to the driver writer, however, is PCI's support for autodetection of interface boards. PCI devices are jumperless (unlike most older peripherals) and are automatically configured at boot time. Then, the device driver must be able to access configuration information in the device in order to complete initialization. This happens without the need to perform any probing.

PCI 寻址

PCI Addressing

每个 PCI 外设由总线号、设备号和 功能号来标识 。PCI 规范允许单个系统最多承载 256 条总线,但由于 256 条总线对于许多大型系统来说是不够的,Linux 现在支持 PCI。每个 PCI 域最多可以承载 256 条总线。每条总线最多可承载 32 个设备,每个设备可以是最多具有 8 个功能的多功能板(例如带有 CD-ROM 驱动器的音频设备)。因此,每个功能都可以通过 16 位地址或密钥在硬件级别进行标识。不过,为 Linux 编写的设备驱动程序不需要处理这些二进制地址,因为它们使用称为 的特定数据结构来pci_dev作用于设备。

Each PCI peripheral is identified by a bus number, a device number, and a function number. The PCI specification permits a single system to host up to 256 buses, but because 256 buses are not sufficient for many large systems, Linux now supports PCI domains. Each PCI domain can host up to 256 buses. Each bus hosts up to 32 devices, and each device can be a multifunction board (such as an audio device with an accompanying CD-ROM drive) with a maximum of eight functions. Therefore, each function can be identified at hardware level by a 16-bit address, or key. Device drivers written for Linux, though, don't need to deal with those binary addresses, because they use a specific data structure, called pci_dev, to act on the devices.

最新的工作站至少具有两条 PCI 总线。通过 桥接器在单个系统中插入多个总线 ,专用 PCI 外设,其任务是连接两条总线。PCI系统的整体布局是一棵树,其中每条总线都连接到上层总线,直到树根的总线0。CardBus PC 卡系统还通过桥接器连接到 PCI 系统。典型的 PCI 系统如图 12-1所示,其中突出显示了各种桥接器。

Most recent workstations feature at least two PCI buses. Plugging more than one bus in a single system is accomplished by means of bridges , special-purpose PCI peripherals whose task is joining two buses. The overall layout of a PCI system is a tree where each bus is connected to an upper-layer bus, up to bus 0 at the root of the tree. The CardBus PC-card system is also connected to the PCI system via bridges. A typical PCI system is represented in Figure 12-1, where the various bridges are highlighted.

典型 PCI 系统布局

图 12-1。典型 PCI 系统布局

Figure 12-1. Layout of a typical PCI system

与 PCI 外设关联的 16 位硬件地址虽然大部分隐藏在struct pci_dev对象中,但偶尔仍然可见,特别是在使用设备列表时。其中一种情况是lspci ( pciutils包的一部分,大多数发行版都可用)的输出以及/proc/pci/proc/bus/pci中的信息布局。PCI 设备的 sysfs 表示也显示了这种寻址方案,并添加了 PCI 域信息。[ 1 ]显示硬件地址时,可以显示为两个值(8 位总线号和 8 位设备和功能号)、三个值(总线、设备和功能)或四个值(域、总线、设备和功能);所有值通常以十六进制显示。

The 16-bit hardware addresses associated with PCI peripherals, although mostly hidden in the struct pci_dev object, are still visible occasionally, especially when lists of devices are being used. One such situation is the output of lspci (part of the pciutils package, available with most distributions) and the layout of information in /proc/pci and /proc/bus/pci. The sysfs representation of PCI devices also shows this addressing scheme, with the addition of the PCI domain information.[1] When the hardware address is displayed, it can be shown as two values (an 8-bit bus number and an 8-bit device and function number), as three values (bus, device, and function), or as four values (domain, bus, device, and function); all the values are usually displayed in hexadecimal.

例如,/proc/bus/pci/devices使用单个 16 位字段(以方便解析和排序),而/proc/bus/ busnumber将地址拆分为三个字段。下面显示了这些地址的显示方式,仅显示输出行的开头:

For example, /proc/bus/pci/devices uses a single 16-bit field (to ease parsing and sorting), while /proc/bus/ busnumber splits the address into three fields. The following shows how those addresses appear, showing only the beginning of the output lines:

$lspci | cut -d: -f1-3
0000:00:00.0 主桥
0000:00:00.1 内存
0000:00:00.2 内存
0000:00:02.0 USB 控制器
0000:00:04.0 多媒体音频控制器
0000:00:06.0 桥
0000:00:07.0 ISA 桥
0000:00:09.0 USB 控制器
0000:00:09.1 USB 控制器
0000:00:09.2 USB 控制器
0000:00:0c.0 CardBus 桥
0000:00:0f.0 IDE接口
0000:00:10.0 以太网控制器
0000:00:12.0 网络控制器
0000:00:13.0 火线 (IEEE 1394)
0000:00:14.0 VGA 兼容控制器
$cat /proc/bus/pci/devices | cut -f1
0000
0001
0002
0010
0020
0030
第0038章
第0048章
第0049章
004a
第0060章
第0078章
第0080章
第0090章
第0098章
00a0
$tree /sys/bus/pci/devices/
/sys/总线/pci/设备/
|-- 0000:00:00.0 -> ../../../devices/pci0000:00/0000:00:00.0
|-- 0000:00:00.1 -> ../../../devices/pci0000:00/0000:00:00.1
|-- 0000:00:00.2 -> ../../../devices/pci0000:00/0000:00:00.2
|-- 0000:00:02.0 -> ../../../devices/pci0000:00/0000:00:02.0
|-- 0000:00:04.0 -> ../../../devices/pci0000:00/0000:00:04.0
|-- 0000:00:06.0 -> ../../../devices/pci0000:00/0000:00:06.0
|-- 0000:00:07.0 -> ../../../devices/pci0000:00/0000:00:07.0
|-- 0000:00:09.0 -> ../../../devices/pci0000:00/0000:00:09.0
|-- 0000:00:09.1 -> ../../../devices/pci0000:00/0000:00:09.1
|-- 0000:00:09.2 -> ../../../devices/pci0000:00/0000:00:09.2
|-- 0000:00:0c.0 -> ../../../devices/pci0000:00/0000:00:0c.0
|-- 0000:00:0f.0 -> ../../../devices/pci0000:00/0000:00:0f.0
|-- 0000:00:10.0 -> ../../../devices/pci0000:00/0000:00:10.0
|-- 0000:00:12.0 -> ../../../devices/pci0000:00/0000:00:12.0
|-- 0000:00:13.0 -> ../../../devices/pci0000:00/0000:00:13.0
`-- 0000:00:14.0 -> ../../../devices/pci0000:00/0000:00:14.0
$ lspci | cut -d: -f1-3
0000:00:00.0 Host bridge
0000:00:00.1 RAM memory
0000:00:00.2 RAM memory
0000:00:02.0 USB Controller
0000:00:04.0 Multimedia audio controller
0000:00:06.0 Bridge
0000:00:07.0 ISA bridge
0000:00:09.0 USB Controller
0000:00:09.1 USB Controller
0000:00:09.2 USB Controller
0000:00:0c.0 CardBus bridge
0000:00:0f.0 IDE interface
0000:00:10.0 Ethernet controller
0000:00:12.0 Network controller
0000:00:13.0 FireWire (IEEE 1394)
0000:00:14.0 VGA compatible controller
$ cat /proc/bus/pci/devices | cut -f1
0000
0001
0002
0010
0020
0030
0038
0048
0049
004a
0060
0078
0080
0090
0098
00a0
$ tree /sys/bus/pci/devices/
/sys/bus/pci/devices/
|-- 0000:00:00.0 -> ../../../devices/pci0000:00/0000:00:00.0
|-- 0000:00:00.1 -> ../../../devices/pci0000:00/0000:00:00.1
|-- 0000:00:00.2 -> ../../../devices/pci0000:00/0000:00:00.2
|-- 0000:00:02.0 -> ../../../devices/pci0000:00/0000:00:02.0
|-- 0000:00:04.0 -> ../../../devices/pci0000:00/0000:00:04.0
|-- 0000:00:06.0 -> ../../../devices/pci0000:00/0000:00:06.0
|-- 0000:00:07.0 -> ../../../devices/pci0000:00/0000:00:07.0
|-- 0000:00:09.0 -> ../../../devices/pci0000:00/0000:00:09.0
|-- 0000:00:09.1 -> ../../../devices/pci0000:00/0000:00:09.1
|-- 0000:00:09.2 -> ../../../devices/pci0000:00/0000:00:09.2
|-- 0000:00:0c.0 -> ../../../devices/pci0000:00/0000:00:0c.0
|-- 0000:00:0f.0 -> ../../../devices/pci0000:00/0000:00:0f.0
|-- 0000:00:10.0 -> ../../../devices/pci0000:00/0000:00:10.0
|-- 0000:00:12.0 -> ../../../devices/pci0000:00/0000:00:12.0
|-- 0000:00:13.0 -> ../../../devices/pci0000:00/0000:00:13.0
`-- 0000:00:14.0 -> ../../../devices/pci0000:00/0000:00:14.0

所有三个设备列表都按相同的顺序排序,因为 lspci使用/proc文件作为其信息源。以VGA视频控制器为例,0x00a0表示0000:00:14.0 分为域(16位)、总线(8位)、设备(5位)和功能(3位)。

All three lists of devices are sorted in the same order, since lspci uses the /proc files as its source of information. Taking the VGA video controller as an example, 0x00a0 means 0000:00:14.0 when split into domain (16 bits), bus (8 bits), device (5 bits) and function (3 bits).

每个外围板的硬件电路都会回答与三个地址空间相关的查询:内存位置、I/O 端口和配置寄存器。前两个地址空间由同一 PCI 总线上的所有设备共享(即,当您访问内存位置时,该 PCI 总线上的所有设备同时看到总线周期)。另一方面,配置空间利用地理寻址 。配置查询一次仅处理一个槽,因此它们永远不会发生冲突。

The hardware circuitry of each peripheral board answers queries pertaining to three address spaces: memory locations, I/O ports, and configuration registers. The first two address spaces are shared by all the devices on the same PCI bus (i.e., when you access a memory location, all the devices on that PCI bus see the bus cycle at the same time). The configuration space, on the other hand, exploits geographical addressing . Configuration queries address only one slot at a time, so they never collide.

对于司机来说, 内存和 I/O 区域通过inbreadb以通常的方式访问,等等。另一方面,配置事务是通过调用特定的内核函数来访问配置寄存器来执行的。关于中断,每个 PCI 插槽都有 4 个中断引脚,每个设备功能都可以使用其中一个中断引脚,而无需关心这些引脚如何路由到 CPU。这种路由是计算机平台的责任,并且是在 PCI 总线之外实现的。由于 PCI 规范要求中断线可共享,因此即使是具有有限数量 IRQ 线的处理器(例如 x86)也可以托管许多 PCI 接口板(每个板有四个中断引脚)。

As far as the driver is concerned, memory and I/O regions are accessed in the usual ways via inb, readb, and so forth. Configuration transactions, on the other hand, are performed by calling specific kernel functions to access configuration registers. With regard to interrupts, every PCI slot has four interrupt pins, and each device function can use one of them without being concerned about how those pins are routed to the CPU. Such routing is the responsibility of the computer platform and is implemented outside of the PCI bus. Since the PCI specification requires interrupt lines to be shareable, even a processor with a limited number of IRQ lines, such as the x86, can host many PCI interface boards (each with four interrupt pins).

PCI总线中的I/O空间使用32位地址总线(导致4GB I/O端口),而内存空间可以使用32位或64位地址进行访问。64 位地址可在较新的平台上使用。地址对于一台设备来说应该是唯一的,但软件可能会错误地将两台设备配置为同一地址,从而导致无法访问任一设备。但除非驱动程序愿意使用它不应该接触的寄存器,否则这个问题永远不会发生。好消息是接口板提供的每个存储器和 I/O 地址区域都可以通过配置事务重新映射。也就是说,固件在系统启动时初始化 PCI 硬件,将每个区域映射到不同的地址以避免冲突。[ 2 ]这些区域当前映射到的地址可以从配置空间中读取,因此Linux驱动程序可以访问其设备而无需探测。读取配置寄存器后,驱动程序可以安全地访问其硬件。

The I/O space in a PCI bus uses a 32-bit address bus (leading to 4 GB of I/O ports), while the memory space can be accessed with either 32-bit or 64-bit addresses. 64-bit addresses are available on more recent platforms. Addresses are supposed to be unique to one device, but software may erroneously configure two devices to the same address, making it impossible to access either one. But this problem never occurs unless a driver is willingly playing with registers it shouldn't touch. The good news is that every memory and I/O address region offered by the interface board can be remapped by means of configuration transactions. That is, the firmware initializes PCI hardware at system boot, mapping each region to a different address to avoid collisions.[2] The addresses to which these regions are currently mapped can be read from the configuration space, so the Linux driver can access its devices without probing. After reading the configuration registers, the driver can safely access its hardware.

PCI配置每个设备功能的空间由 256 个字节组成(PCI Express 设备除外,每个功能有 4 KB 的配置空间),并且配置寄存器的布局是标准化的。配置空间的四个字节保存唯一的功能 ID,因此驱动程序可以通过查找该外设的特定 ID 来识别其设备。[ 3 ]总之,每个设备板都经过地理寻址以检索其配置寄存器;然后,这些寄存器中的信息可用于执行正常的 I/O 访问,而无需进一步的地理寻址。

The PCI configuration space consists of 256 bytes for each device function (except for PCI Express devices, which have 4 KB of configuration space for each function), and the layout of the configuration registers is standardized. Four bytes of the configuration space hold a unique function ID, so the driver can identify its device by looking for the specific ID for that peripheral.[3] In summary, each device board is geographically addressed to retrieve its configuration registers; the information in those registers can then be used to perform normal I/O access, without the need for further geographic addressing.

从这个描述中应该可以清楚地看出,PCI 接口标准相对于 ISA 的主要创新是配置地址空间。因此,除了通常的驱动程序代码之外,PCI 驱动程序还需要能够访问配置空间,以便避免执行危险的探测任务。

It should be clear from this description that the main innovation of the PCI interface standard over ISA is the configuration address space. Therefore, in addition to the usual driver code, a PCI driver needs the ability to access the configuration space, in order to save itself from risky probing tasks.

在本章的其余部分中,我们使用“设备”一词指代设备功能,因为多功能板中的每个功能都充当独立的实体。当我们提到设备时,我们指的是元组“域号、总线号、设备号和功能号”。

For the remainder of this chapter, we use the word device to refer to a device function, because each function in a multifunction board acts as an independent entity. When we refer to a device, we mean the tuple "domain number, bus number, device number, and function number."

开机时间

Boot Time

为了了解 PCI 的工作原理,我们 从系统启动开始,因为这是配置设备的时候。

To see how PCI works, we start from system boot, since that's when the devices are configured.

当 PCI 设备通电时,硬件保持不活动状态。换句话说,设备仅响应配置事务。上电时,设备没有内存,也没有映射到计算机地址空间的 I/O 端口;所有其他特定于设备的功能(例如中断报告)也被禁用。

When power is applied to a PCI device, the hardware remains inactive. In other words, the device responds only to configuration transactions. At power on, the device has no memory and no I/O ports mapped in the computer's address space; every other device-specific feature, such as interrupt reporting, is disabled as well.

幸运的是,每个 PCI 主板都配备了 PCI 感知固件,称为 BIOS、NVRAM 或 PROM,具体取决于平台。固件通过读取和写入 PCI 控制器中的寄存器来提供对设备配置地址空间的访问。

Fortunately, every PCI motherboard is equipped with PCI-aware firmware, called the BIOS, NVRAM, or PROM, depending on the platform. The firmware offers access to the device configuration address space by reading and writing registers in the PCI controller.

在系统启动时,固件(或 Linux 内核,如果如此配置)与每个 PCI 外设执行配置事务,以便为其提供的每个地址区域分配一个安全的位置。当设备驱动程序访问设备时,其内存和 I/O 区域已经映射到处理器的地址空间。驱动程序可以更改此默认分配,但它永远不需要这样做。

At system boot, the firmware (or the Linux kernel, if so configured) performs configuration transactions with every PCI peripheral in order to allocate a safe place for each address region it offers. By the time a device driver accesses the device, its memory and I/O regions have already been mapped into the processor's address space. The driver can change this default assignment, but it never needs to do that.

根据建议,用户可以通过读取/proc/bus/pci/devices/proc/bus/pci/*/*来查看 PCI 设备列表和设备的配置寄存器。前者是带有(十六进制)设备信息的文本文件,后者是报告每个设备的配置寄存器快照的二进制文件,每个设备一个文件。sysfs 树中的各个 PCI 设备目录可以在/sys/bus/pci/devices中找到。PCI 设备目录包含许多不同的文件:

As suggested, the user can look at the PCI device list and the devices' configuration registers by reading /proc/bus/pci/devices and /proc/bus/pci/*/*. The former is a text file with (hexadecimal) device information, and the latter are binary files that report a snapshot of the configuration registers of each device, one file per device. The individual PCI device directories in the sysfs tree can be found in /sys/bus/pci/devices. A PCI device directory contains a number of different files:

$tree /sys/bus/pci/devices/0000:00:10.0
/sys/bus/pci/devices/0000:00:10.0
|-- 类
|-- 配置
|-- 分离状态
|-- 设备
|-- 中断请求
|-- 电源
| `-- 状态
|-- 资源
|-- 子系统设备
|-- 子系统供应商
`-- 供应商
$ tree /sys/bus/pci/devices/0000:00:10.0
/sys/bus/pci/devices/0000:00:10.0
|-- class
|-- config
|-- detach_state
|-- device
|-- irq
|-- power
|   `-- state
|-- resource
|-- subsystem_device
|-- subsystem_vendor
`-- vendor

文件config是一个二进制文件,允许从设备读取原始 PCI 配置信息(就像/proc/bus/pci/*/*提供的那样)。文件vendordevicesubsystem_devicesubsystem_vendorclass all参考该PCI设备的具体值(所有PCI设备都提供该信息)。文件irq显示当前分配给该PCI设备的IRQ,文件资源显示该设备当前分配的内存资源。

The file config is a binary file that allows the raw PCI config information to be read from the device (just like the /proc/bus/pci/*/* provides.) The files vendor, device, subsystem_device, subsystem_vendor, and class all refer to the specific values of this PCI device (all PCI devices provide this information.) The file irq shows the current IRQ assigned to this PCI device, and the file resource shows the current memory resources allocated by this device.

配置寄存器和初始化

Configuration Registers and Initialization

在本节中,我们看 在 PCI 设备包含的配置寄存器中。所有 PCI 设备都至少具有 256 字节的地址空间。前 64 个字节是标准化的,其余的则取决于设备。图 12-2显示了与设备无关的配置空间的布局。

In this section, we look at the configuration registers that PCI devices contain. All PCI devices feature at least a 256-byte address space. The first 64 bytes are standardized, while the rest are device dependent. Figure 12-2 shows the layout of the device-independent configuration space.

标准化 PCI 配置寄存器

图 12-2。标准化 PCI 配置寄存器

Figure 12-2. The standardized PCI configuration registers

如图所示,有些 PCI 配置寄存器是必需的,有些是可选的。每个 PCI 设备必须在所需寄存器中包含有意义的值,而可选寄存器的内容取决于外设的实际功能。除非必填字段的内容表明它们有效,否则不会使用可选字段。因此,必填字段断言板的功能,包括其他字段是否可用。

As the figure shows, some of the PCI configuration registers are required and some are optional. Every PCI device must contain meaningful values in the required registers, whereas the contents of the optional registers depend on the actual capabilities of the peripheral. The optional fields are not used unless the contents of the required fields indicate that they are valid. Thus, the required fields assert the board's capabilities, including whether the other fields are usable.

有趣的是,PCI 寄存器始终是小端字节序。尽管该标准被设计为独立于体系结构,但 PCI 设计者有时会表现出对 PC 环境的轻微偏见。驱动程序编写者在访问多字节配置寄存器时应注意字节顺序;在 PC 上运行的代码可能无法在其他平台上运行。Linux 开发人员已经解决了字节排序问题(请参阅下一节,第 12.1.8 节),但必须牢记这个问题。如果您需要将数据从主机顺序转换为 PCI 顺序,反之亦然,您可以使用 <asm/byteorder.h>中定义的函数(第 11 章中介绍),知道 PCI 字节顺序是小端字节序。

It's interesting to note that the PCI registers are always little-endian. Although the standard is designed to be architecture independent, the PCI designers sometimes show a slight bias toward the PC environment. The driver writer should be careful about byte ordering when accessing multibyte configuration registers; code that works on the PC might not work on other platforms. The Linux developers have taken care of the byte-ordering problem (see the next section, Section 12.1.8), but the issue must be kept in mind. If you ever need to convert data from host order to PCI order or vice versa, you can resort to the functions defined in <asm/byteorder.h>, introduced in Chapter 11, knowing that PCI byte order is little-endian.

描述所有配置项超出了本书的范围。通常,每个设备发布的技术文档都会描述支持的寄存器。我们感兴趣的是驱动程序如何查找其设备以及如何访问设备的配置空间。

Describing all the configuration items is beyond the scope of this book. Usually, the technical documentation released with each device describes the supported registers. What we're interested in is how a driver can look for its device and how it can access the device's configuration space.

三个或五个 PCI 寄存器标识一个设备:vendorIDdeviceID、 和class是始终使用的三个寄存器。每个 PCI 制造商都会为这些只读寄存器分配适当的值,驱动程序可以使用它们来查找设备。此外,这些字段subsystem vendorID有时subsystem deviceID由供应商设置,以进一步区分类似的设备。

Three or five PCI registers identify a device: vendorID, deviceID, and class are the three that are always used. Every PCI manufacturer assigns proper values to these read-only registers, and the driver can use them to look for the device. Additionally, the fields subsystem vendorID and subsystem deviceID are sometimes set by the vendor to further differentiate similar devices.

让我们更详细地看看这些寄存器:

Let's look at these registers in more detail:

vendorID
vendorID

该 16 位寄存器标识硬件制造商。例如,每个英特尔设备都标有相同的供应商编号0x8086。这些号码有一个全球注册记录,由 PCI 特别兴趣小组维护,制造商必须申请为其分配一个唯一的号码。

This 16-bit register identifies a hardware manufacturer. For instance, every Intel device is marked with the same vendor number, 0x8086. There is a global registry of such numbers, maintained by the PCI Special Interest Group, and manufacturers must apply to have a unique number assigned to them.

deviceID
deviceID

这是另一个16位寄存器,由制造商选择;设备ID无需正式注册。此 ID 通常与供应商 ID 配对,为硬件设备生成唯一的 32 位标识符。我们使用“签名”一词 来指代供应商和设备 ID 对。设备驱动程序通常依赖签名来识别其设备;您可以在目标设备的硬件手册中找到要查找的值。

This is another 16-bit register, selected by the manufacturer; no official registration is required for the device ID. This ID is usually paired with the vendor ID to make a unique 32-bit identifier for a hardware device. We use the word signature to refer to the vendor and device ID pair. A device driver usually relies on the signature to identify its device; you can find what value to look for in the hardware manual for the target device.

class
class

每个外围设备都属于一个。该class寄存器是一个 16 位值,其前 8 位标识“基类”(或)。例如,“以太网”和“令牌环”是属于“网络”组的两个类,而“串行”和“并行”类则属于“通信”组。有些驱动程序可以支持多个类似的设备,每个设备都有不同的签名,但都属于同一类;这些驱动程序可以依靠class寄存器来识别其外设,如下所示。

Every peripheral device belongs to a class. The class register is a 16-bit value whose top 8 bits identify the "base class" (or group). For example, "ethernet" and "token ring" are two classes belonging to the "network" group, while the "serial" and "parallel" classes belong to the "communication" group. Some drivers can support several similar devices, each of them featuring a different signature but all belonging to the same class; these drivers can rely on the class register to identify their peripherals, as shown later.

subsystem vendorID

subsystem deviceID
subsystem vendorID

subsystem deviceID

这些字段可用于进一步识别设备。如果该芯片是本地(板载)总线的通用接口芯片,则它通常用于几个完全不同的角色,并且驱动程序必须识别它正在与之通信的实际设备。子系统标识符用于此目的。

These fields can be used for further identification of a device. If the chip is a generic interface chip to a local (onboard) bus, it is often used in several completely different roles, and the driver must identify the actual device it is talking with. The subsystem identifiers are used to this end.

使用这些不同的标识符,PCI 驱动程序可以告诉内核它支持什么类型的设备。这struct pci_device_id结构体用于定义驱动程序支持的不同类型的 PCI 设备的列表。该结构包含以下字段:

Using these different identifiers, a PCI driver can tell the kernel what kind of devices it supports. The struct pci_device_id structure is used to define a list of the different types of PCI devices that a driver supports. This structure contains the following fields:

_ _u32 vendor;

_ _u32 device;
_ _u32 vendor;

_ _u32 device;

它们指定设备的 PCI 供应商和设备 ID。如果驱动程序可以处理任何供应商或设备 ID,则该值PCI_ANY_ID 应用于这些字段。

These specify the PCI vendor and device IDs of a device. If a driver can handle any vendor or device ID, the value PCI_ANY_ID should be used for these fields.

_ _u32 subvendor;

_ _u32 subdevice;
_ _u32 subvendor;

_ _u32 subdevice;

它们指定设备的 PCI 子系统供应商和子系统设备 ID。如果驱动程序可以处理任何类型的子系统 ID,则应将值PCI_ANY_ID用于这些字段。

These specify the PCI subsystem vendor and subsystem device IDs of a device. If a driver can handle any type of subsystem ID, the value PCI_ANY_ID should be used for these fields.

_ _u32 class;

_ _u32 class_mask;
_ _u32 class;

_ _u32 class_mask;

这两个值允许驱动程序指定它支持某种类型的 PCI 类设备。PCI 规范中描述了不同类别的 PCI 设备(VGA 控制器就是一个例子)。如果驱动程序可以处理任何类型的子系统 ID,则应将值PCI_ANY_ID用于这些字段。

These two values allow the driver to specify that it supports a type of PCI class device. The different classes of PCI devices (a VGA controller is one example) are described in the PCI specification. If a driver can handle any type of subsystem ID, the value PCI_ANY_ID should be used for these fields.

kernel_ulong_t driver_data;
kernel_ulong_t driver_data;

该值不用于匹配设备,而是用于保存 PCI 驱动程序可以根据需要用来区分不同设备的信息。

This value is not used to match a device but is used to hold information that the PCI driver can use to differentiate between different devices if it wants to.

应使用两个辅助宏来初始化struct pci_device_id结构:

There are two helper macros that should be used to initialize a struct pci_device_id structure:

PCI_DEVICE(vendor, device)
PCI_DEVICE(vendor, device)

这将创建struct pci_device_id仅匹配特定供应商和设备 ID 的 。该宏将结构体的subvendor和字段设置为。subdevicePCI_ANY_ID

This creates a struct pci_device_id that matches only the specific vendor and device ID. The macro sets the subvendor and subdevice fields of the structure to PCI_ANY_ID.

PCI_DEVICE_CLASS(device_class, device_class_mask)
PCI_DEVICE_CLASS(device_class, device_class_mask)

这将创建struct pci_device_id与特定 PCI 类匹配的 。

This creates a struct pci_device_id that matches a specific PCI class.

使用这些宏定义驱动程序支持的设备类型的示例可以在以下内核文件中找到:

An example of using these macros to define the type of devices a driver supports can be found in the following kernel files:

驱动程序/usb/host/ehci-hcd.c:

静态 const struct pci_device_id pci_ids[ ] = { {
        /* 处理任何 USB 2.0 EHCI 控制器 */
        PCI_DEVICE_CLASS(((PCI_CLASS_SERIAL_USB << 8) | 0x20), ~0),
        .driver_data = (无符号长整型) &ehci_driver,
        },
        { /* 结束:全零 */ }
};

驱动程序/i2c/busses/i2c-i810.c:

静态结构 pci_device_id i810_ids[ ] = {
    { PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82810_IG1) },
    { PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82810_IG3) },
    { PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82810E_IG) },
    { PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82815_CGC) },
    { PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82845G_IG) },
    { 0, },
};
drivers/usb/host/ehci-hcd.c:

static const struct pci_device_id pci_ids[  ] = { {
        /* handle any USB 2.0 EHCI controller */
        PCI_DEVICE_CLASS(((PCI_CLASS_SERIAL_USB << 8) | 0x20), ~0),
        .driver_data =  (unsigned long) &ehci_driver,
        },
        { /* end: all zeroes */ }
};

drivers/i2c/busses/i2c-i810.c:

static struct pci_device_id i810_ids[  ] = {
    { PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82810_IG1) },
    { PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82810_IG3) },
    { PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82810E_IG) },
    { PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82815_CGC) },
    { PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82845G_IG) },
    { 0, },
};

这些示例创建一个结构列表struct pci_device_id,其中一个空结构设置为全零作为列表中的最后一个值。该 ID 数组用于struct pci_driver(如下所述),并且还用于告诉用户空间该特定驱动程序支持哪些设备。

These examples create a list of struct pci_device_id structures, with an empty structure set to all zeros as the last value in the list. This array of IDs is used in the struct pci_driver (described below), and it is also used to tell user space which devices this specific driver supports.

模块设备表

MODULE_DEVICE_TABLE

pci_device_id结构需要导出到用户空间,以允许热插拔和模块加载系统知道哪些模块与哪些硬件设备配合使用。宏观MODULE_DEVICE_TABLE 实现这一点。一个例子是:

This pci_device_id structure needs to be exported to user space to allow the hotplug and module loading systems know what module works with what hardware devices. The macro MODULE_DEVICE_TABLE accomplishes this. An example is:

MODULE_DEVICE_TABLE(pci, i810_ids);
MODULE_DEVICE_TABLE(pci, i810_ids);

该语句创建一个名为 的局部变量,_ _mod_pci_device_table该变量指向 的列表struct pci_device_id。稍后在内核构建过程中, depmod程序会在所有模块中搜索符号 _ _mod_pci_device_table。如果找到该符号,它将从模块中提取数据并将其添加到文件/lib/modules/KERNEL_VERSION/modules.pcimap中。完成后depmod,该文件中会列出内核中模块支持的所有 PCI 设备及其模块名称。当内核告诉热插拔系统已找到新的 PCI 设备时,热插拔系统使用 module.pcimap文件来查找要加载的正确驱动程序。

This statement creates a local variable called _ _mod_pci_device_table that points to the list of struct pci_device_id. Later in the kernel build process, the depmod program searches all modules for the symbol _ _mod_pci_device_table. If that symbol is found, it pulls the data out of the module and adds it to the file /lib/modules/KERNEL_VERSION/modules.pcimap. After depmod completes, all PCI devices that are supported by modules in the kernel are listed, along with their module names, in that file. When the kernel tells the hotplug system that a new PCI device has been found, the hotplug system uses the modules.pcimap file to find the proper driver to load.

注册 PCI 驱动程序

Registering a PCI Driver

主要结构全部PCI 驱动程序必须创建该结构才能正确注册到内核struct pci_driver。该结构由许多回调函数和变量组成,它们向 PCI 核心描述 PCI 驱动程序。以下是 PCI 驱动程序需要了解的该结构中的字段:

The main structure that all PCI drivers must create in order to be registered with the kernel properly is the struct pci_driver structure. This structure consists of a number of function callbacks and variables that describe the PCI driver to the PCI core. Here are the fields in this structure that a PCI driver needs to be aware of:

const char *name;
const char *name;

司机的姓名。它在内核中的所有 PCI 驱动程序中必须是唯一的,并且通常设置为与驱动程序的模块名称相同的名称。当驱动程序位于内核中时,它会显示在/sys/bus/pci/drivers/下的 sysfs 中。

The name of the driver. It must be unique among all PCI drivers in the kernel and is normally set to the same name as the module name of the driver. It shows up in sysfs under /sys/bus/pci/drivers/ when the driver is in the kernel.

const struct pci_device_id *id_table;
const struct pci_device_id *id_table;

指向struct pci_device_id本章前面描述的表的指针。

Pointer to the struct pci_device_id table described earlier in this chapter.

int (*probe) (struct pci_dev *dev, const struct pci_device_id *id);
int (*probe) (struct pci_dev *dev, const struct pci_device_id *id);

指向 PCI 驱动程序中的探测函数的指针。当 PCI 核心有它struct pci_dev认为该驱动程序想要控制的对象时,它会调用此函数。指向struct pci_device_idPCI 内核用于做出此决定的指针也传递给此函数。如果 PCI 驱动程序声明struct pci_dev传递给它的 ,它应该正确初始化设备并返回0。如果驱动程序不想声明该设备,或者发生错误,则应返回负错误值。有关此功能的更多详细信息将在本章后面介绍。

Pointer to the probe function in the PCI driver. This function is called by the PCI core when it has a struct pci_dev that it thinks this driver wants to control. A pointer to the struct pci_device_id that the PCI core used to make this decision is also passed to this function. If the PCI driver claims the struct pci_dev that is passed to it, it should initialize the device properly and return 0. If the driver does not want to claim the device, or an error occurs, it should return a negative error value. More details about this function follow later in this chapter.

void (*remove) (struct pci_dev *dev);
void (*remove) (struct pci_dev *dev);

指向 PCI 内核在struct pci_dev从系统中删除或从内核卸载 PCI 驱动程序时调用的函数的指针。有关此功能的更多详细信息将在本章后面介绍。

Pointer to the function that the PCI core calls when the struct pci_dev is being removed from the system, or when the PCI driver is being unloaded from the kernel. More details about this function follow later in this chapter.

int (*suspend) (struct pci_dev *dev, u32 state);
int (*suspend) (struct pci_dev *dev, u32 state);

指向 PCI 内核在struct pci_dev挂起时调用的函数的指针。挂起状态在变量中传递state。该功能是可选的;司机不必提供它。

Pointer to the function that the PCI core calls when the struct pci_dev is being suspended. The suspend state is passed in the state variable. This function is optional; a driver does not have to provide it.

int (*resume) (struct pci_dev *dev);
int (*resume) (struct pci_dev *dev);

指向 PCI 内核在struct pci_dev恢复时调用的函数的指针。它总是在suspend被调用之后被调用。该功能是可选的;司机不必提供它。

Pointer to the function that the PCI core calls when the struct pci_dev is being resumed. It is always called after suspend has been called. This function is optional; a driver does not have to provide it.

总之,要创建正确的struct pci_driver结构,只需要初始化四个字段:

In summary, to create a proper struct pci_driver structure, only four fields need to be initialized:

静态结构 pci_driver pci_driver = {
    .name = "pci_skel",
    .id_table = id,
    .probe = 探针,
    .remove=删除,
};
static struct pci_driver pci_driver = {
    .name = "pci_skel",
    .id_table = ids,
    .probe = probe,
    .remove = remove,
};

struct pci_driver为了向 PCI 内核注册,需要使用指向 的指针来调用 pci_register_driverstruct pci_driver。传统上,这是在 PCI 驱动程序的模块初始化代码中完成的:

To register the struct pci_driver with the PCI core, a call to pci_register_driver is made with a pointer to the struct pci_driver. This is traditionally done in the module initialization code for the PCI driver:

静态 int _ _init pci_skel_init(void)
{
    返回 pci_register_driver(&pci_driver);
}
static int _ _init pci_skel_init(void)
{
    return pci_register_driver(&pci_driver);
}

请注意,pci_register_driver函数要么返回负错误号,要么返回0所有内容是否已成功注册。它不会返回绑定到驱动程序的设备数量,如果没有设备绑定到驱动程序,则不会返回错误号。这是 2.6 版本之前内核的一项更改,是由于以下情况而完成的:

Note that the pci_register_driver function either returns a negative error number or 0 if everything was registered successfully. It does not return the number of devices that were bound to the driver or an error number if no devices were bound to the driver. This is a change from kernels prior to the 2.6 release and was done because of the following situations:

  • 在支持 PCI 热插拔的系统或 CardBus 系统上,PCI 设备可以随时出现或消失。如果可以在设备出现之前加载驱动程序,这将有助于减少初始化设备所需的时间。

  • On systems that support PCI hotplug, or CardBus systems, a PCI device can appear or disappear at any point in time. It is helpful if drivers can be loaded before the device appears, to reduce the time it takes to initialize a device.

  • 2.6 内核允许在加载驱动程序后将新的 PCI ID 动态分配给驱动程序。new_id这是通过在 sysfs 中的所有 PCI 驱动程序目录中创建的文件来完成的。如果正在使用内核尚不知道的新设备,这非常有用。用户可以将 PCI ID 值写入new_id文件,然后驱动程序绑定到新设备。如果在系统中存在设备之前不允许加载驱动程序,则该接口将无法工作。

  • The 2.6 kernel allows new PCI IDs to be dynamically allocated to a driver after it has been loaded. This is done through the file new_id that is created in all PCI driver directories in sysfs. This is very useful if a new device is being used that the kernel doesn't know about just yet. A user can write the PCI ID values to the new_id file, and then the driver binds to the new device. If a driver was not allowed to load until a device was present in the system, this interface would not be able to work.

当要卸载PCI驱动程序时,struct pci_driver需要从内核中注销。这是通过调用pci_unregister_driver来完成的。发生此调用时,当前绑定到此驱动程序的任何 PCI 设备都将被删除,并且在pci_unregister_driver函数返回 之前调用此 PCI 驱动程序的 删除函数。

When the PCI driver is to be unloaded, the struct pci_driver needs to be unregistered from the kernel. This is done with a call to pci_unregister_driver. When this call happens, any PCI devices that were currently bound to this driver are removed, and the remove function for this PCI driver is called before the pci_unregister_driver function returns.

静态无效_ _exit pci_skel_exit(无效)
{
    pci_unregister_driver(&pci_driver);
}
static void _ _exit pci_skel_exit(void)
{
    pci_unregister_driver(&pci_driver);
}

旧式 PCI 探测

Old-Style PCI Probing

在旧内核中版本中,函数pci_register_driver并不总是由 PCI 驱动程序使用。相反,他们要么手动遍历系统中的 PCI 设备列表,要么调用可以搜索特定 PCI 设备的函数。在驱动程序中遍历系统中 PCI 设备列表的功能已从 2.6 内核中删除,以防止驱动程序在删除设备的同时碰巧修改 PCI 设备列表时导致内核崩溃。

In older kernel versions, the function, pci_register_driver, was not always used by PCI drivers. Instead, they would either walk the list of PCI devices in the system by hand, or they would call a function that could search for a specific PCI device. The ability to walk the list of PCI devices in the system within a driver has been removed from the 2.6 kernel in order to prevent drivers from crashing the kernel if they happened to modify the PCI device lists while a device was being removed at the same time.

如果确实需要查找特定 PCI 设备的能力,可以使用以下函数:

If the ability to find a specific PCI device is really needed, the following functions are available:

struct pci_dev *pci_get_device(unsigned int vendor, unsigned int device,

struct pci_dev *from);
struct pci_dev *pci_get_device(unsigned int vendor, unsigned int device,

struct pci_dev *from);

该函数扫描系统中当前存在的 PCI 设备列表,如果输入参数与指定的vendordeviceID 匹配,则会增加找到的变量的引用计数struct pci_dev,并将其返回给调用者。这可以防止结构在没有任何通知的情况下消失,并确保内核不会发生oops。驱动程序处理完struct pci_dev函数返回的值后,必须调用函数pci_dev_put来正确减少使用计数,以允许内核在设备被删除时清理该设备。

This function scans the list of PCI devices currently present in the system, and if the input arguments match the specified vendor and device IDs, it increments the reference count on the struct pci_dev variable found, and returns it to the caller. This prevents the structure from disappearing without any notice and ensures that the kernel does not oops. After the driver is done with the struct pci_dev returned by the function, it must call the function pci_dev_put to decrement the usage count properly back to allow the kernel to clean up the device if it is removed.

from参数用于获取具有相同签名的多个设备;该参数应指向已找到的最后一个设备,以便可以继续搜索而不是从列表的头部重新开始。要查找第一个设备,from指定为 NULL。如果没有找到(更多)设备,NULL则返回。

The from argument is used to get hold of multiple devices with the same signature; the argument should point to the last device that has been found, so that the search can continue instead of restarting from the head of the list. To find the first device, from is specified as NULL. If no (further) device is found, NULL is returned.

如何正确使用此功能的示例是:

An example of how to use this function properly is:

结构 pci_dev *dev;
dev = pci_get_device(PCI_VENDOR_FOO, PCI_DEVICE_FOO, NULL);
如果(开发){
    /* 使用 PCI 设备 */
    ...
    pci_dev_put(dev);
}
struct pci_dev *dev;
dev = pci_get_device(PCI_VENDOR_FOO, PCI_DEVICE_FOO, NULL);
if (dev) {
    /* Use the PCI device */
    ...
    pci_dev_put(dev);
}

该函数不能从中断上下文中调用。如果是,则会在系统日志中打印一条警告。

This function can not be called from interrupt context. If it is, a warning is printed out to the system log.

struct pci_dev *pci_get_subsys(unsigned int vendor, unsigned int device,

unsigned int ss_vendor, unsigned int ss_device, struct pci_dev *from);
struct pci_dev *pci_get_subsys(unsigned int vendor, unsigned int device,

unsigned int ss_vendor, unsigned int ss_device, struct pci_dev *from);

此函数的工作方式与pci_get_device类似,但它允许在查找设备时指定子系统供应商和子系统设备 ID。

This function works just like pci_get_device, but it allows the subsystem vendor and subsystem device IDs to be specified when looking for the device.

该函数不能从中断上下文中调用。如果是,则会在系统日志中打印一条警告。

This function can not be called from interrupt context. If it is, a warning is printed out to the system log.

struct pci_dev *pci_get_slot(struct pci_bus *bus, unsigned int devfn);
struct pci_dev *pci_get_slot(struct pci_bus *bus, unsigned int devfn);

该函数在指定的系统中的 PCI 设备列表中搜索 struct pci_bus指定的设备和 PCI 设备的功能号。如果发现匹配的设备,则其引用计数会增加并返回指向它的指针。当调用者完成访问 时struct pci_dev,它必须调用 pci_dev_put

This function searches the list of PCI devices in the system on the specified struct pci_bus for the specified device and function number of the PCI device. If a device is found that matches, its reference count is incremented and a pointer to it is returned. When the caller is finished accessing the struct pci_dev, it must call pci_dev_put.

所有这些函数都不能从中断上下文中调用。如果是,则会在系统日志中打印一条警告。

All of these functions can not be called from interrupt context. If they are, a warning is printed out to the system log.

启用 PCI 设备

Enabling the PCI Device

在PCI 驱动程序的探测函数中,在驱动程序可以访问 PCI 设备的任何设备资源(I/O 区域或中断)之前,驱动程序必须调用 pci_enable_device函数

In the probe function for the PCI driver, before the driver can access any device resource (I/O region or interrupt) of the PCI device, the driver must call the pci_enable_device function:

int pci_enable_device(struct pci_dev *dev);
int pci_enable_device(struct pci_dev *dev);

该函数实际上启用了设备。它唤醒设备,在某些情况下还分配其中断线和 I/O 区域。例如,CardBus 设备(在驱动程序级别上与 PCI 完全等效)就会发生这种情况。

This function actually enables the device. It wakes up the device and in some cases also assigns its interrupt line and I/O regions. This happens, for example, with CardBus devices (which have been made completely equivalent to PCI at the driver level).

访问配置空间

Accessing the Configuration Space

司机接完后 检测到设备后,通常需要读取或写入三个地址空间:内存、端口和配置。特别是,访问配置空间对于驱动程序至关重要,因为这是它找出设备在内存和 I/O 空间中映射位置的唯一方法。

After the driver has detected the device, it usually needs to read from or write to the three address spaces: memory, port, and configuration. In particular, accessing the configuration space is vital to the driver, because it is the only way it can find out where the device is mapped in memory and in the I/O space.

由于微处理器无法直接访问配置空间,因此计算机供应商必须提供一种方法来实现这一点。要访问配置空间,CPU 必须写入和读取 PCI 控制器中的寄存器,但确切的实现取决于供应商,与本讨论无关,因为 Linux 提供了访问配置空间的标准接口。

Because the microprocessor has no way to access the configuration space directly, the computer vendor has to provide a way to do it. To access configuration space, the CPU must write and read registers in the PCI controller, but the exact implementation is vendor dependent and not relevant to this discussion, because Linux offers a standard interface to access the configuration space.

就驱动程序而言,可以通过 8 位、16 位或 32 位数据传输来访问配置空间。相关函数原型在 <linux/pci.h>中:

As far as the driver is concerned, the configuration space can be accessed through 8-bit, 16-bit, or 32-bit data transfers. The relevant functions are prototyped in <linux/pci.h>:

int pci_read_config_byte(struct pci_dev *dev, int where, u8 *val);

int pci_read_config_word(struct pci_dev *dev, int where, u16 *val);

int pci_read_config_dword(struct pci_dev *dev, int where, u32 *val);
int pci_read_config_byte(struct pci_dev *dev, int where, u8 *val);

int pci_read_config_word(struct pci_dev *dev, int where, u16 *val);

int pci_read_config_dword(struct pci_dev *dev, int where, u32 *val);

从 标识的设备的配置空间中读取 1、2 或 4 个字节dev。参数where是距配置空间开头的字节偏移量。从配置空间中取出的值通过val指针返回,函数的返回值是错误代码。word和 dword函数将刚刚读取的值从小端字节序转换为处理器的本机字节顺序,因此您无需处理字节顺序

Read one, two, or four bytes from the configuration space of the device identified by dev. The where argument is the byte offset from the beginning of the configuration space. The value fetched from the configuration space is returned through the val pointer, and the return value of the functions is an error code. The word and dword functions convert the value just read from little-endian to the native byte order of the processor, so you need not deal with byte ordering.

int pci_write_config_byte(struct pci_dev *dev, int where, u8 val);

int pci_write_config_word(struct pci_dev *dev, int where, u16 val);

int pci_write_config_dword(struct pci_dev *dev, int where, u32 val);
int pci_write_config_byte(struct pci_dev *dev, int where, u8 val);

int pci_write_config_word(struct pci_dev *dev, int where, u16 val);

int pci_write_config_dword(struct pci_dev *dev, int where, u32 val);

将 1、2 或 4 个字节写入配置空间。设备dev照常由 标识,写入的值作为 传递val。word 和dword函数在写入外围设备之前将值转换为小端字节序

Write one, two, or four bytes to the configuration space. The device is identified by dev as usual, and the value being written is passed as val. The word and dword functions convert the value to little-endian before writing to the peripheral device.

前面的所有函数都是作为内联函数实现的,这些函数实际上调用了以下函数。struct pci_dev如果驱动程序在任何特定时刻都无法访问 a,请随意使用这些函数而不是上述函数:

All of the previous functions are implemented as inline functions that really call the following functions. Feel free to use these functions instead of the above in case the driver does not have access to a struct pci_dev at any paticular moment in time:

int pci_bus_read_config_byte (struct pci_bus *bus, unsigned int devfn, int

where, u8 *val);

int pci_bus_read_config_word (struct pci_bus *bus, unsigned int devfn, int

where, u16 *val);

int pci_bus_read_config_dword (struct pci_bus *bus, unsigned int devfn, int

where, u32 *val);
int pci_bus_read_config_byte (struct pci_bus *bus, unsigned int devfn, int

where, u8 *val);

int pci_bus_read_config_word (struct pci_bus *bus, unsigned int devfn, int

where, u16 *val);

int pci_bus_read_config_dword (struct pci_bus *bus, unsigned int devfn, int

where, u32 *val);

就像pci_read_函数一样,但是需要struct pci_bus *devfn 变量而不是struct pci_dev *.

Just like the pci_read_ functions, but struct pci_bus * and devfn variables are needed instead of a struct pci_dev *.

int pci_bus_write_config_byte (struct pci_bus *bus, unsigned int devfn, int

where, u8 val);

int pci_bus_write_config_word (struct pci_bus *bus, unsigned int devfn, int

where, u16 val);

int pci_bus_write_config_dword (struct pci_bus *bus, unsigned int devfn, int

where, u32 val);
int pci_bus_write_config_byte (struct pci_bus *bus, unsigned int devfn, int

where, u8 val);

int pci_bus_write_config_word (struct pci_bus *bus, unsigned int devfn, int

where, u16 val);

int pci_bus_write_config_dword (struct pci_bus *bus, unsigned int devfn, int

where, u32 val);

就像pci_write_函数一样,但是需要struct pci_bus *devfn 变量而不是struct pci_dev *.

Just like the pci_write_ functions, but struct pci_bus * and devfn variables are needed instead of a struct pci_dev *.

使用pci_read_函数寻址配置变量的最佳方法 是通过 <linux/pci.h>中定义的符号名称。where例如,以下小函数通过将符号名称传递给 pci_read_config_byte来检索设备的修订 ID :

The best way to address the configuration variables using the pci_read_ functions is by means of the symbolic names defined in <linux/pci.h>. For example, the following small function retrieves the revision ID of a device by passing the symbolic name for where to pci_read_config_byte:

静态无符号字符 skel_get_revision(struct pci_dev *dev)
{
    u8修订版;

    pci_read_config_byte(dev, PCI_REVISION_ID, &revision);
    返回修订版;
}
static unsigned char skel_get_revision(struct pci_dev *dev)
{
    u8 revision;

    pci_read_config_byte(dev, PCI_REVISION_ID, &revision);
    return revision;
}

访问 I/O 和内存空间

Accessing the I/O and Memory Spaces

PCI设备实现 最多六个 I/O 地址区域。每个区域由内存或 I/O 位置组成。大多数设备在内存区域中实现 I/O 寄存器,因为这通常是一种更明智的方法。然而,与普通内存不同,I/O 寄存器不应该由 CPU 缓存,因为每次访问都会产生副作用。将 I/O 寄存器实现为内存区域的 PCI 设备通过在其配置寄存器中设置“内存可预取”位来标记差异。[ 4 ]如果内存区域被标记为可预取,CPU 可以缓存其内容并用它进行各种优化;另一方面,不可预取的内存访问无法优化,因为每次访问都会产生副作用,就像 I/O 端口一样。将其控制寄存器映射到内存地址范围的外设将该范围声明为不可预取,而 PCI 板上的视频内存之类的外设是可预取的。在本节中,我们使用“区域”一词 指内存映射或端口映射的通用 I/O 地址空间。

A PCI device implements up to six I/O address regions. Each region consists of either memory or I/O locations. Most devices implement their I/O registers in memory regions, because it's generally a saner approach. However, unlike normal memory, I/O registers should not be cached by the CPU because each access can have side effects. The PCI device that implements I/O registers as a memory region marks the difference by setting a "memory-is-prefetchable" bit in its configuration register.[4] If the memory region is marked as prefetchable, the CPU can cache its contents and do all sorts of optimization with it; nonprefetchable memory access, on the other hand, can't be optimized because each access can have side effects, just as with I/O ports. Peripherals that map their control registers to a memory address range declare that range as nonprefetchable, whereas something like video memory on PCI boards is prefetchable. In this section, we use the word region to refer to a generic I/O address space that is memory-mapped or port-mapped.

接口板使用配置寄存器(图 12-2所示的六个 32 位寄存器,其符号名称为 )PCI_BASE_ADDRESS_0报告其区域的大小和当前位置PCI_BASE_ADDRESS_5。由于 PCI 定义的 I/O 空间是 32 位地址空间,因此对内存和 I/O 使用相同的配置接口是有意义的。PCI_BASE_ADDRESS如果设备使用 64 位地址总线,则可以通过为每个区域使用两个连续的寄存器(低位在前)来声明 64 位内存空间中的区域。一台设备可以同时提供 32 位区域和 64 位区域。

An interface board reports the size and current location of its regions using configuration registers—the six 32-bit registers shown in Figure 12-2, whose symbolic names are PCI_BASE_ADDRESS_0 through PCI_BASE_ADDRESS_5. Since the I/O space defined by PCI is a 32-bit address space, it makes sense to use the same configuration interface for memory and I/O. If the device uses a 64-bit address bus, it can declare regions in the 64-bit memory space by using two consecutive PCI_BASE_ADDRESS registers for each region, low bits first. It is possible for one device to offer both 32-bit regions and 64-bit regions.

在内核中,PCI设备的I/O区域已经被集成到通用资源管理中。因此,您无需访问配置变量即可了解设备在内存或 I/O 空间中的映射位置。获取区域信息的首选接口包含以下函数:

In the kernel, the I/O regions of PCI devices have been integrated into the generic resource management. For this reason, you don't need to access the configuration variables in order to know where your device is mapped in memory or I/O space. The preferred interface for getting region information consists of the following functions:

unsigned long pci_resource_start(struct pci_dev *dev, int bar);
unsigned long pci_resource_start(struct pci_dev *dev, int bar);

该函数返回与六个 PCI I/O 区域之一关联的第一个地址(内存地址或 I/O 端口号)。该区域由整数bar(基地址寄存器)选择,范围为0-5(含)。

The function returns the first address (memory address or I/O port number) associated with one of the six PCI I/O regions. The region is selected by the integer bar (the base address register), ranging from 0-5 (inclusive).

unsigned long pci_resource_end(struct pci_dev *dev, int bar);
unsigned long pci_resource_end(struct pci_dev *dev, int bar);

该函数返回属于 I/O 区域号一部分的最后一个地址 bar。请注意,这是最后一个可用地址,而不是该区域之后的第一个地址。

The function returns the last address that is part of the I/O region number bar. Note that this is the last usable address, not the first address after the region.

unsigned long pci_resource_flags(struct pci_dev *dev, int bar);
unsigned long pci_resource_flags(struct pci_dev *dev, int bar);

该函数返回与该资源关联的标志。

This function returns the flags associated with this resource.

资源标志用于定义单个资源的一些功能。对于与 PCI I/O 区域关联的 PCI 资源,信息是从基地址寄存器中提取的,但对于与 PCI 设备不关联的资源,信息可以来自其他地方。

Resource flags are used to define some features of the individual resource. For PCI resources associated with PCI I/O regions, the information is extracted from the base address registers, but can come from elsewhere for resources not associated with PCI devices.

所有资源标志 在<linux/ioport.h>中定义;最重要的是:

All resource flags are defined in <linux/ioport.h>; the most important are:

IORESOURCE_IO

IORESOURCE_MEM
IORESOURCE_IO

IORESOURCE_MEM

如果关联的 I/O 区域存在,则仅设置这些标志之一。

If the associated I/O region exists, one and only one of these flags is set.

IORESOURCE_PREFETCH

IORESOURCE_READONLY
IORESOURCE_PREFETCH

IORESOURCE_READONLY

这些标志表明内存区域是否可预取和/或写保护。后一个标志永远不会为 PCI 资源设置。

These flags tell whether a memory region is prefetchable and/or write protected. The latter flag is never set for PCI resources.

通过使用pci_resource_ 函数时,设备驱动程序可以完全忽略底层 PCI 寄存器,因为系统已经使用它们来构造资源信息。

By making use of the pci_resource_ functions, a device driver can completely ignore the underlying PCI registers, since the system already used them to structure resource information.

PCI 中断

PCI Interrupts

就中断而言就PCI而言,还是好办的。当Linux启动时,计算机的固件已经为设备分配了一个唯一的中断号,驱动程序只需要使用它即可。中断号存储在配置寄存器60(PCI_INTERRUPT_LINE)中,该寄存器为一字节宽。这允许多达 256 个中断线,但实际限制取决于所使用的 CPU。驱动程序不需要费心检查中断号,因为在中找到的值PCI_INTERRUPT_LINE保证是正确的。

As far as interrupts are concerned, PCI is easy to handle. By the time Linux boots, the computer's firmware has already assigned a unique interrupt number to the device, and the driver just needs to use it. The interrupt number is stored in configuration register 60 (PCI_INTERRUPT_LINE), which is one byte wide. This allows for as many as 256 interrupt lines, but the actual limit depends on the CPU being used. The driver doesn't need to bother checking the interrupt number, because the value found in PCI_INTERRUPT_LINE is guaranteed to be the right one.

如果设备不支持中断,则寄存器 61 ( PCI_INTERRUPT_PIN) 为0; 否则,它是非零的。但是,由于驱动程序知道其设备是否是中断驱动的,因此通常不需要读取PCI_INTERRUPT_PIN.

If the device doesn't support interrupts, register 61 (PCI_INTERRUPT_PIN) is 0; otherwise, it's nonzero. However, since the driver knows if its device is interrupt driven or not, it doesn't usually need to read PCI_INTERRUPT_PIN.

因此,处理中断的 PCI 专用代码只需要读取配置字节即可获取保存在局部变量中的中断号,如下代码所示。除此之外,第 10 章中的信息也适用。

Thus, PCI-specific code for dealing with interrupts just needs to read the configuration byte to obtain the interrupt number that is saved in a local variable, as shown in the following code. Beyond that, the information in Chapter 10 applies.

结果 = pci_read_config_byte(dev, PCI_INTERRUPT_LINE, &myirq);
如果(结果){
    /* 处理错误 */
}
result = pci_read_config_byte(dev, PCI_INTERRUPT_LINE, &myirq);
if (result) {
    /* deal with error */
}

本节的其余部分为好奇的读者提供了附加信息,但编写驱动程序不需要这些信息。

The rest of this section provides additional information for the curious reader but isn't needed for writing drivers.

PCI 连接器有四个中断引脚,外围板可以使用其中任何一个或全部中断引脚。每个引脚都单独路由到主板的中断控制器,因此可以共享中断,而不会出现任何电气问题。然后,中断控制器负责将中断线(引脚)映射到处理器的硬件;这种与平台相关的操作留给控制器,以实现总线本身的平台独立性。

A PCI connector has four interrupt pins, and peripheral boards can use any or all of them. Each pin is individually routed to the motherboard's interrupt controller, so interrupts can be shared without any electrical problems. The interrupt controller is then responsible for mapping the interrupt wires (pins) to the processor's hardware; this platform-dependent operation is left to the controller in order to achieve platform independence in the bus itself.

位于的只读配置寄存器PCI_INTERRUPT_PIN用于告诉计算机实际使用的是哪一个引脚。值得记住的是,每个设备板最多可以托管八个设备;每个设备使用单个中断引脚并在其自己的配置寄存器中报告它。同一设备板上的不同设备可以使用不同的中断引脚或共享相同的中断引脚。

The read-only configuration register located at PCI_INTERRUPT_PIN is used to tell the computer which single pin is actually used. It's worth remembering that each device board can host up to eight devices; each device uses a single interrupt pin and reports it in its own configuration register. Different devices on the same device board can use different interrupt pins or share the same one.

PCI_INTERRUPT_LINE另一方面,寄存器是读/写的。当计算机启动时,固件会扫描其 PCI 设备,并根据 PCI 插槽的中断引脚路由方式设置每个设备的寄存器。该值由固件分配,因为只有固件知道主板如何将不同的中断引脚路由到处理器。然而,对于设备驱动程序来说,该PCI_INTERRUPT_LINE寄存器是只读的。有趣的是,最新版本的 Linux 内核在某些情况下可以分配中断线,而无需求助于 BIOS。

The PCI_INTERRUPT_LINE register, on the other hand, is read/write. When the computer is booted, the firmware scans its PCI devices and sets the register for each device according to how the interrupt pin is routed for its PCI slot. The value is assigned by the firmware, because only the firmware knows how the motherboard routes the different interrupt pins to the processor. For the device driver, however, the PCI_INTERRUPT_LINE register is read-only. Interestingly, recent versions of the Linux kernel under some circumstances can assign interrupt lines without resorting to the BIOS.

硬件抽象

Hardware Abstractions

我们完成讨论 通过快速查看系统如何处理过多的 PCI 控制器来了解 PCI市场上有售。这只是一个信息部分,旨在向好奇的读者展示内核的面向对象布局如何向下延伸到最低级别。

We complete the discussion of PCI by taking a quick look at how the system handles the plethora of PCI controllers available on the marketplace. This is just an informational section, meant to show the curious reader how the object-oriented layout of the kernel extends down to the lowest levels.

用于实现硬件抽象的机制是包含方法的常见结构。这是一种强大的技术,仅将取消引用指针的最小开销添加到函数调用的正常开销中。在 PCI 管理的情况下,唯一依赖于硬件的操作是读写配置寄存器,因为 PCI 世​​界中的其他操作都是通过直接读写 I/O 和内存地址空间来完成的,而这些都在直接控制CPU。

The mechanism used to implement hardware abstraction is the usual structure containing methods. It's a powerful technique that adds just the minimal overhead of dereferencing a pointer to the normal overhead of a function call. In the case of PCI management, the only hardware-dependent operations are the ones that read and write configuration registers, because everything else in the PCI world is accomplished by directly reading and writing the I/O and memory address spaces, and those are under direct control of the CPU.

因此,配置寄存器访问的相关结构仅包括两个字段:

Thus, the relevant structure for configuration register access includes only two fields:

结构体 pci_ops {
    int (*read)(struct pci_bus *bus, unsigned int devfn, int where, int size,
                u32 *val);
    int (*write)(struct pci_bus *bus, unsigned int devfn, int where, int size,
                 u32 值);
};
struct pci_ops {
    int (*read)(struct pci_bus *bus, unsigned int devfn, int where, int size, 
                u32 *val);
    int (*write)(struct pci_bus *bus, unsigned int devfn, int where, int size, 
                 u32 val);
};

该结构在<linux/pci.h>中定义 ,并由drivers/pci/pci.c使用,其中定义了实际的公共函数。

The structure is defined in <linux/pci.h> and used by drivers/pci/pci.c, where the actual public functions are defined.

作用于 PCI 配置空间的两个函数比取消引用指针有更多的开销;由于代码的高度面向对象性,它们使用级联指针,但对于很少执行且从不在速度关键的路径中执行的操作来说,开销不是问题。例如,pci_read_config_byte(dev, where, val)的实际实现 扩展到:

The two functions that act on the PCI configuration space have more overhead than dereferencing a pointer; they use cascading pointers due to the high object-orientedness of the code, but the overhead is not an issue in operations that are performed quite rarely and never in speed-critical paths. The actual implementation of pci_read_config_byte(dev, where, val), for instance, expands to:

dev->bus->ops->read(bus, devfn, where, 8, val);
dev->bus->ops->read(bus, devfn, where, 8, val);

系统中的各种 PCI 总线在系统启动时被检测到,此时项目 struct pci_bus被创建并与其功能(包括ops字段)相关联。

The various PCI buses in the system are detected at system boot, and that's when the struct pci_bus items are created and associated with their features, including the ops field.

实施 通过“硬件操作”数据结构进行硬件抽象在 Linux 内核中是典型的。一个重要的例子是struct alpha_machine_vector数据结构。它在<asm-alpha/machvec.h>中定义 ,并负责处理不同基于 Alpha 的计算机之间可能发生变化的所有内容。

Implementing hardware abstraction via "hardware operations" data structures is typical in the Linux kernel. One important example is the struct alpha_machine_vector data structure. It is defined in <asm-alpha/machvec.h> and takes care of everything that may change across different Alpha-based computers.

回顾:ISA

A Look Back: ISA

ISA 总线相当古老 设计和性能是出了名的差,但它仍然占据了扩展设备市场的很大一部分。如果速度并不重要并且您想要支持旧主板,那么 ISA 实施方式比 PCI 更可取。这种旧标准的另一个优点是,如果您是电子爱好者,您可以轻松构建自己的 ISA 设备,而 PCI 绝对不可能做到这一点。

The ISA bus is quite old in design and is a notoriously poor performer, but it still holds a good part of the market for extension devices. If speed is not important and you want to support old motherboards, an ISA implementation is preferable to PCI. An additional advantage of this old standard is that if you are an electronic hobbyist, you can easily build your own ISA devices, something definitely not possible with PCI.

另一方面,ISA 的一大缺点是它与 PC 体系结构紧密结合。接口总线具有80286处理器的所有限制,给系统程序员带来无尽的痛苦。ISA 设计(继承自原始 IBM PC)的另一个大问题是缺乏地理寻址,这导致了许多问题以及添加新设备所需的漫长的拔出-重新跳接-插入测试周期。有趣的是,即使是最古老的 Apple II 计算机也已经在利用地理寻址,并且它们具有无跳线扩展板。

On the other hand, a great disadvantage of ISA is that it's tightly bound to the PC architecture; the interface bus has all the limitations of the 80286 processor and causes endless pain to system programmers. The other great problem with the ISA design (inherited from the original IBM PC) is the lack of geographical addressing, which has led to many problems and lengthy unplug-rejumper-plug-test cycles to add new devices. It's interesting to note that even the oldest Apple II computers were already exploiting geographical addressing, and they featured jumperless expansion boards.

尽管有很大的缺点,ISA 仍然被用在一些意想不到的地方。例如,一些掌上电脑中使用的 VR41xx 系列 MIPS 处理器具有 ISA 兼容的扩展总线,这看起来很奇怪。ISA 这些意想不到的用途背后的原因是一些传统硬件(例如基于 8390 的以太网卡)的成本极低,因此具有 ISA 电气信号的 CPU 可以轻松利用糟糕但廉价的 PC 设备。

Despite its great disadvantages, ISA is still used in several unexpected places. For example, the VR41xx series of MIPS processors used in several palmtops features an ISA-compatible expansion bus, strange as it seems. The reason behind these unexpected uses of ISA is the extreme low cost of some legacy hardware, such as 8390-based Ethernet cards, so a CPU with ISA electrical signaling can easily exploit the awful, but cheap, PC devices.

硬件资源

Hardware Resources

ISA设备可以配备I/O 端口、内存区域和中断线。

An ISA device can be equipped with I/O ports, memory areas, and interrupt lines.

尽管 x86 处理器支持 64 KB I/O 端口内存(即处理器声明 16 个地址线),但某些旧 PC 硬件仅解码最低的 10 个地址线。这将可用地址空间限制为 1024 个端口,因为任何仅解码低地址线的设备都会将 1 KB 到 64 KB 范围内的任何地址误认为是低地址。一些外设通过仅将一个端口映射到低千字节并使用高地址线在不同的设备寄存器之间进行选择来规避这一限制。例如,映射的设备0x340可以安全地使用端口0x7400xB40等。

Even though the x86 processors support 64 KB of I/O port memory (i.e., the processor asserts 16 address lines), some old PC hardware decodes only the lowest 10 address lines. This limits the usable address space to 1024 ports, because any address in the range 1 KB to 64 KB is mistaken for a low address by any device that decodes only the low address lines. Some peripherals circumvent this limitation by mapping only one port into the low kilobyte and using the high address lines to select between different device registers. For example, a device mapped at 0x340 can safely use port 0x740, 0xB40, and so on.

如果 I/O 端口的可用性有限,内存访问的情况仍然会更糟。ISA 设备只能使用 640 KB 到 1 MB 之间以及 15 MB 到 16 MB 之间的内存范围来进行 I/O 寄存器和设备控制。640 KB 到 1 MB 的范围供 PC BIOS、VGA 兼容视频板以及各种其他设备使用,为新设备留下的可用空间所剩无几。另一方面,Linux 不直接支持 15 MB 的内存,现在修改内核来支持它是浪费编程时间。

If the availability of I/O ports is limited, memory access is still worse. An ISA device can use only the memory range between 640 KB and 1 MB and between 15 MB and 16 MB for I/O register and device control. The 640-KB to 1-MB range is used by the PC BIOS, by VGA-compatible video boards, and by various other devices, leaving little space available for new devices. Memory at 15 MB, on the other hand, is not directly supported by Linux, and hacking the kernel to support it is a waste of programming time nowadays.

ISA 设备板可用的第三个资源是中断线。有限数量的中断线被路由到ISA总线,并且它们由所有接口板共享。因此,如果设备配置不正确,它们可能会发现自己使用相同的中断线。

The third resource available to ISA device boards is interrupt lines. A limited number of interrupt lines is routed to the ISA bus, and they are shared by all the interface boards. As a result, if devices aren't properly configured, they can find themselves using the same interrupt lines.

尽管最初的 ISA 规范不允许跨设备共享中断,但大多数设备板允许这样做。[ 5 ]第10 章介绍了软件级别的中断共享。

Although the original ISA specification doesn't allow interrupt sharing across devices, most device boards allow it.[5] Interrupt sharing at the software level is described in Chapter 10.

指令集编程

ISA Programming

就编程而言 就这点而言,内核或 BIOS 中没有特定的帮助来简化对 ISA 设备的访问(例如 PCI)。您唯一可以使用的工具是 I/O 端口和 IRQ 线的注册表,如第 10.2 节中所述。

As far as programming is concerned, there's no specific aid in the kernel or the BIOS to ease access to ISA devices (like there is, for example, for PCI). The only facilities you can use are the registries of I/O ports and IRQ lines, described in Section 10.2.

本书第一部分中展示的编程技术适用于 ISA 设备;驱动程序可以探测 I/O 端口,并且必须使用第 10.2.2 节中所示的技术之一自动检测中断线。

The programming techniques shown throughout the first part of this book apply to ISA devices; the driver can probe for I/O ports, and the interrupt line must be autodetected with one of the techniques shown in Section 10.2.2.

辅助函数isa_readb和它的朋友们已经在第 9 章中简单介绍过,这里不再多说。

The helper functions isa_readb and friends have been briefly introduced in Chapter 9, and there's nothing more to say about them.

即插即用规范

The Plug-and-Play Specification

一些新的 ISA 设备板遵循特殊的设计规则,并需要特殊的初始化序列,旨在简化附加接口板的安装和配置。这些板的设计规范称为 即插即用 (PnP),包含用于构建和配置无跳线 ISA 设备的繁琐规则集。PnP 设备实现可重定位 I/O 区域;PC 的 BIOS 负责重新定位——让人想起 PCI。

Some new ISA device boards follow peculiar design rules and require a special initialization sequence intended to simplify installation and configuration of add-on interface boards. The specification for the design of these boards is called plug and play (PnP) and consists of a cumbersome rule set for building and configuring jumperless ISA devices. PnP devices implement relocatable I/O regions; the PC's BIOS is responsible for the relocation—reminiscent of PCI.

简而言之,PnP 的目标是在不改变底层电气接口(ISA 总线)的情况下获得与 PCI 设备相同的灵活性。为此,规范定义了一组独立于设备的配置寄存器和一种对接口板进行地理寻址的方法,即使物理总线不承载每板(地理)布线 - 每条 ISA 信号线都连接到每个可用的投币口。

In short, the goal of PnP is to obtain the same flexibility found in PCI devices without changing the underlying electrical interface (the ISA bus). To this end, the specs define a set of device-independent configuration registers and a way to geographically address the interface boards, even though the physical bus doesn't carry per-board (geographical) wiring—every ISA signal line connects to every available slot.

地理寻址通过分配一个小整数(称为卡选择号)来工作 (CSN),连接到计算机中的每个 PnP 外设。每个 PnP 设备都具有唯一的 64 位宽串行标识符,该标识符硬连线到外围板中。CSN 分配使用唯一的序列号来标识 PnP 设备。但 CSN 只能在启动时安全分配,这要求 BIOS 具有 PnP 意识。因此,旧计算机要求用户获取并插入特定的配置软盘,即使设备支持 PnP。

Geographical addressing works by assigning a small integer, called the card select number (CSN), to each PnP peripheral in the computer. Each PnP device features a unique serial identifier, 64 bits wide, that is hardwired into the peripheral board. CSN assignment uses the unique serial number to identify the PnP devices. But the CSNs can be assigned safely only at boot time, which requires the BIOS to be PnP aware. For this reason, old computers require the user to obtain and insert a specific configuration diskette, even if the device is PnP capable.

遵循 PnP 规范的接口板在硬件层面上很复杂。它们比 PCI 板卡复杂得多,并且需要复杂的软件。安装这些设备时遇到困难并不罕见,即使安装顺利,您仍然面临性能限制和 ISA 总线有限的 I/O 空间。最好尽可能安装 PCI 设备并享受新技术。

Interface boards following the PnP specs are complicated at the hardware level. They are much more elaborate than PCI boards and require complex software. It's not unusual to have difficulty installing these devices, and even if the installation goes well, you still face the performance constraints and the limited I/O space of the ISA bus. It's much better to install PCI devices whenever possible and enjoy the new technology instead.

如果您对 PnP 配置软件感兴趣,可以浏览drivers/net/3c509.c,其探测功能涉及 PnP 设备。2.6内核在PnP设备支持方面做了很多工作,因此与2.6内核相比,很多不灵活的接口都被清理掉了 以前的内核版本。

If you are interested in the PnP configuration software, you can browse drivers/net/3c509.c, whose probing function deals with PnP devices. The 2.6 kernel saw a lot of work in the PnP device support area, so a lot of the inflexible interfaces have been cleaned up compared to previous kernel releases.

PC/104 和 PC/104+

PC/104 and PC/104+

目前在工业界,有两种总线架构非常流行:PC/104 和 PC/104+。两者都是 PC 级单板计算机的标准配置。

Currently in the industrial world, two bus architectures are quite fashionable: PC/104 and PC/104+. Both are standard in PC-class single-board computers.

这两个标准都涉及印刷电路板的特定外形尺寸,以及电路板互连的电气/机械规范。这些总线的实际优点是,它们允许使用设备一侧的插头和插座类型的连接器垂直堆叠电路板。

Both standards refer to specific form factors for printed circuit boards, as well as electrical/mechanical specifications for board interconnections. The practical advantage of these buses is that they allow circuit boards to be stacked vertically using a plug-and-socket kind of connector on one side of the device.

两条总线的电气和逻辑布局与 ISA (PC/104) 和 PCI (PC/104+) 相同,因此软件不会注意到普通桌面总线与这两条总线之间的任何差异。

The electrical and logical layout of the two buses is identical to ISA (PC/104) and PCI (PC/104+), so software won't notice any difference between the usual desktop buses and these two.

其他 PC 总线

Other PC Buses

PCI 和 ISA 是 PC 领域最常用的外设接口,但它们并不是唯一的。以下是 PC 市场上其他总线功能的总结。

PCI and ISA are the most commonly used peripheral interfaces in the PC world, but they aren't the only ones. Here's a summary of the features of other buses found in the PC market.

MCA

MCA

微通道架构(MCA)是PS/2 计算机和某些笔记本电脑中使用的 IBM 标准。在硬件层面,Micro Channel比ISA具有更多的功能。它支持多主 DMA、32 位地址和数据线、共享中断线以及用于访问每板配置寄存器的地理寻址。此类寄存器称为可编程选项选择 (POS),但它们不具备 PCI 寄存器的所有功能。Linux 对 Micro Channel 的支持包括导出到模块的功能。

Micro Channel Architecture (MCA) is an IBM standard used in PS/2 computers and some laptops. At the hardware level, Micro Channel has more features than ISA. It supports multimaster DMA, 32-bit address and data lines, shared interrupt lines, and geographical addressing to access per-board configuration registers. Such registers are called Programmable Option Select (POS), but they don't have all the features of the PCI registers. Linux support for Micro Channel includes functions that are exported to modules.

设备驱动程序可以读取整数值MCA_bus 以查看它是否正在微通道计算机上运行。如果符号是预处理器宏,MCA_bus_ _is_a_macro则也定义该宏。如果MCA_bus_ _is_a_macro未定义,则 MCA_bus是导出到模块化代码的整数变量。和MCA_BUS都在<asm/processor.h>MCA_bus_ _is_a_macro中定义。

A device driver can read the integer value MCA_bus to see if it is running on a Micro Channel computer. If the symbol is a preprocessor macro, the macro MCA_bus_ _is_a_macro is defined as well. If MCA_bus_ _is_a_macro is undefined, then MCA_bus is an integer variable exported to modularized code. Both MCA_BUS and MCA_bus_ _is_a_macro are defined in <asm/processor.h>.

环境影响评估

EISA

扩展 ISA (EISA) 总线是 32 位扩展至 ISA,具有兼容的接口连接器;ISA 设备板可以插入 EISA 连接器。额外的电线布线在 ISA 触点下方。

The Extended ISA (EISA) bus is a 32-bit extension to ISA, with a compatible interface connector; ISA device boards can be plugged into an EISA connector. The additional wires are routed under the ISA contacts.

与 PCI 和 MCA 一样,EISA 总线设计用于承载无跳线设备,并且具有与 MCA 相同的功能:32 位地址和数据线、多主 DMA 和共享中断线。EISA设备由软件配置,但它们不需要任何特定的操作系统支持。Linux 内核中已经存在用于以太网设备和 SCSI 控制器的 EISA 驱动程序。

Like PCI and MCA, the EISA bus is designed to host jumperless devices, and it has the same features as MCA: 32-bit address and data lines, multimaster DMA, and shared interrupt lines. EISA devices are configured by software, but they don't need any particular operating system support. EISA drivers already exist in the Linux kernel for Ethernet devices and SCSI controllers.

EISA 驱动程序检查该值EISA_bus以确定主机是否带有 EISA 总线。像MCA_bus,EISA_bus要么是宏,要么是变量,具体取决于是否EISA_bus_ _is_a_macro定义。这两个符号都在<asm/processor.h>中定义。

An EISA driver checks the value EISA_bus to determine if the host computer carries an EISA bus. Like MCA_bus, EISA_bus is either a macro or a variable, depending on whether EISA_bus_ _is_a_macro is defined. Both symbols are defined in <asm/processor.h>.

内核对具有 sysfs 和资源管理功能的设备提供全面的 EISA 支持。它位于drivers/eisa目录中。

The kernel has full EISA support for devices with sysfs and resource management functionality. This is located in the drivers/eisa directory.

甚大LB

VLB

ISA 的另一个扩展是 VESA 本地总线 (VLB) 接口总线,扩展了 ISA 通过添加第三个纵向插槽来连接器。设备只需插入这个额外的连接器(无需插入两个相关的 ISA 连接器),因为 VLB 插槽复制来自 ISA 连接器的所有重要信号。这种不使用 ISA 插槽的“独立”VLB 外设很少见,因为大多数设备需要到达后面板,以便其外部连接器可用。

Another extension to ISA is the VESA Local Bus (VLB) interface bus, which extends the ISA connectors by adding a third lengthwise slot. A device can just plug into this extra connector (without plugging in the two associated ISA connectors), because the VLB slot duplicates all important signals from the ISA connectors. Such "standalone" VLB peripherals not using the ISA slot are rare, because most devices need to reach the back panel so that their external connectors are available.

VESA 总线的功能比 EISA、MCA 和 PCI 总线受到更多限制,并且正在从市场上消失。VLB 不存在特殊的内核支持。然而,Linux 2.0 中的 Lance 以太网驱动程序和 IDE 磁盘驱动程序都可以处理其设备的 VLB 版本。

The VESA bus is much more limited in its capabilities than the EISA, MCA, and PCI buses and is disappearing from the market. No special kernel support exists for VLB. However, both the Lance Ethernet driver and the IDE disk driver in Linux 2.0 can deal with VLB versions of their devices.

系统总线

SBus

虽然现在大多数计算机都是大多数老式基于 SPARC 的工作站配备 PCI 或 ISA 接口总线,使用 SBus 来连接其外设。

While most computers nowadays are equipped with a PCI or ISA interface bus, most older SPARC-based workstations use SBus to connect their peripherals.

SBus 是一种相当先进的设计,尽管它已经存在很长时间了。它是独立于处理器的(即使只有 SPARC 计算机使用它)并且针对 I/O 外围板进行了优化。换句话说,您无法将额外的 RAM 插入 SBus 插槽(即使在 ISA 世界中,RAM 扩展板也早已被遗忘,而且 PCI 也不支持它们)。这种优化旨在简化硬件设备和系统软件的设计,但会增加主板的复杂性。

SBus is quite an advanced design, although it has been around for a long time. It is meant to be processor independent (even though only SPARC computers use it) and is optimized for I/O peripheral boards. In other words, you can't plug additional RAM into SBus slots (RAM expansion boards have long been forgotten even in the ISA world, and PCI does not support them either). This optimization is meant to simplify the design of both hardware devices and system software, at the expense of some additional complexity in the motherboard.

总线的这种 I/O 偏差导致外设使用虚拟 地址来传输数据,从而无需分配连续的 DMA 缓冲区。主板负责解码虚拟地址并将其映射到物理地址。这需要在总线上附加一个MMU(内存管理单元);负责该任务的芯片组称为IOMMU。虽然在某种程度上比在接口总线上使用物理地址更复杂,但由于 SPARC 处理器始终通过将 MMU 内核与 CPU 内核分开(无论是物理上还是至少在概念上)来设计,这一设计大大简化了。实际上,这种设计选择是其他智能处理器设计所共有的,并且总体上是有益的。该总线的另一个特点是设备板利用大量地理寻址,

This I/O bias of the bus results in peripherals using virtual addresses to transfer data, thus bypassing the need to allocate a contiguous DMA buffer. The motherboard is responsible for decoding the virtual addresses and mapping them to physical addresses. This requires attaching an MMU (memory management unit) to the bus; the chipset in charge of the task is called IOMMU. Although somehow more complex than using physical addresses on the interface bus, this design is greatly simplified by the fact that SPARC processors have always been designed by keeping the MMU core separate from the CPU core (either physically or at least conceptually). Actually, this design choice is shared by other smart processor designs and is beneficial overall. Another feature of this bus is that device boards exploit massive geographical addressing, so there's no need to implement an address decoder in every peripheral or to deal with address conflicts.

SBus 外设在其 PROM 中使用 Forth 语言来初始化自身。选择 Forth 是因为该解释器是轻量级的,因此可以在任何计算机系统的固件中轻松实现。此外,SBus 规范概述了启动过程,以便兼容的 I/O 设备能够轻松地融入系统并在系统启动时被识别。这是支持多平台设备的重要一步;这与我们习惯的以 PC 为中心的 ISA 完全不同。然而,由于多种商业原因,它并没有成功。

SBus peripherals use the Forth language in their PROMs to initialize themselves. Forth was chosen because the interpreter is lightweight and, therefore, can be easily implemented in the firmware of any computer system. In addition, the SBus specification outlines the boot process, so that compliant I/O devices fit easily into the system and are recognized at system boot. This was a great step to support multi-platform devices; it's a completely different world from the PC-centric ISA stuff we were used to. However, it didn't succeed for a variety of commercial reasons.

尽管当前的内核版本为 SBus 设备提供了功能齐全的支持,但总线现在使用得很少,因此不值得在此详细介绍。有兴趣的读者可以查看arch/sparc/kernelarch/sparc/mm中的源文件。

Although current kernel versions offer quite full-featured support for SBus devices, the bus is used so little nowadays that it's not worth covering in detail here. Interested readers can look at source files in arch/sparc/kernel and arch/sparc/mm.

努总线

NuBus

另一个有趣但几乎被遗忘的接口总线是 NuBus。它存在于较旧的 Mac 计算机(具有 M68k 系列 CPU 的计算机)上。

Another interesting, but nearly forgotten, interface bus is NuBus. It is found on older Mac computers (those with the M68k family of CPUs).

全部 总线是内存映射的(与 M68k 的所有内容一样),并且设备仅进行地理寻址。这很好,也是 Apple 的典型做法,因为更老的 Apple II 已经有类似的总线布局。糟糕的是,几乎不可能找到有关 NuBus 的文档,因为 Apple 对其 Mac 计算机始终遵循封闭的一切政策(与之前的 Apple II 不同,后者的源代码和原理图只需很少的成本即可获得)。

All of the bus is memory-mapped (like everything with the M68k), and the devices are only geographically addressed. This is good and typical of Apple, as the much older Apple II already had a similar bus layout. What is bad is that it's almost impossible to find documentation on NuBus, due to the close-everything policy Apple has always followed with its Mac computers (and unlike the previous Apple II, whose source code and schematics were available at little cost).

文件drivers/nubus/nubus.c几乎包含了我们所知道的有关该总线的所有内容,读起来很有趣;它显示了开发人员必须做多少困难的逆向工程。

The file drivers/nubus/nubus.c includes almost everything we know about this bus, and it's interesting reading; it shows how much hard reverse engineering developers had to do.

外部总线

External Buses

其中最...之一 接口总线领域的最新条目是整个外部总线类别。这包括 USB、FireWire 和 IEEE1284(基于并行端口的外部总线)。这些接口有点类似于较旧的非外部技术,例如 PCMCIA/CardBus 甚至 SCSI。

One of the most recent entries in the field of interface buses is the whole class of external buses. This includes USB, FireWire, and IEEE1284 (parallel-port-based external bus). These interfaces are somewhat similar to older and not-so-external technology, such as PCMCIA/CardBus and even SCSI.

从概念上讲,这些总线既不是全功能接口总线(如 PCI),也不是哑通信通道(如串行端口)。很难对利用其功能所需的软件进行分类,因为它通常分为两个级别:硬件控制器的驱动程序(如第 12.1 节中介绍的 PCI SCSI 适配器或 PCI 控制器的驱动程序)和特定的驱动程序“客户端”设备(例如sd.c处理通用 SCSI 磁盘,所谓的 PCI 驱动程序处理插入总线的卡)。

Conceptually, these buses are neither full-featured interface buses (like PCI is) nor dumb communication channels (like the serial ports are). It's hard to classify the software that is needed to exploit their features, as it's usually split into two levels: the driver for the hardware controller (like drivers for PCI SCSI adaptors or PCI controllers introduced in the Section 12.1) and the driver for the specific "client" device (like sd.c handles generic SCSI disks and so-called PCI drivers deal with cards plugged in the bus).

快速参考

Quick Reference

本节总结了本章中介绍的符号:

This section summarizes the symbols introduced in the chapter:

#include <linux/pci.h>
#include <linux/pci.h>

包含 PCI 符号名称的标头寄存器以及多个供应商和设备 ID 值。

Header that includes symbolic names for the PCI registers and several vendor and device ID values.

struct pci_dev;
struct pci_dev;

表示内核中 PCI 设备的结构。

Structure that represents a PCI device within the kernel.

struct pci_driver;
struct pci_driver;

表示 PCI 驱动程序的结构。所有 PCI 驱动程序都必须定义它。

Structure that represents a PCI driver. All PCI drivers must define this.

struct pci_device_id;
struct pci_device_id;

描述的结构该驱动程序支持的 PCI 设备类型。

Structure that describes the types of PCI devices this driver supports.

int pci_register_driver(struct pci_driver *drv);

int pci_module_init(struct pci_driver *drv);

void pci_unregister_driver(struct pci_driver *drv);
int pci_register_driver(struct pci_driver *drv);

int pci_module_init(struct pci_driver *drv);

void pci_unregister_driver(struct pci_driver *drv);

从内核注册或取消注册 PCI 驱动程序的函数。

Functions that register or unregister a PCI driver from the kernel.

struct pci_dev *pci_find_device(unsigned int vendor, unsigned int device,

,struct pci_dev *from);

struct pci_dev *pci_find_device_reverse(unsigned int vendor, unsigned int

device, const struct pci_dev *from);

struct pci_dev *pci_find_subsys (unsigned int vendor, unsigned int device

unsigned int ss_vendor, unsigned int ss_device, const struct pci_dev *from);

struct pci_dev *pci_find_class(unsigned int class, struct pci_dev *from);
struct pci_dev *pci_find_device(unsigned int vendor, unsigned int device,

struct pci_dev *from);

struct pci_dev *pci_find_device_reverse(unsigned int vendor, unsigned int

device, const struct pci_dev *from);

struct pci_dev *pci_find_subsys (unsigned int vendor, unsigned int device,

unsigned int ss_vendor, unsigned int ss_device, const struct pci_dev *from);

struct pci_dev *pci_find_class(unsigned int class, struct pci_dev *from);

搜索功能 设备具有特定签名或属于特定类别的设备的列表。NULL如果没有找到则返回值 。from用于继续搜索;它必须是NULL您第一次调用任一函数,并且如果您正在搜索更多设备,它必须指向刚刚找到的设备。不建议使用这些函数,请使用pci_get_变体。

Functions that search the device list for devices with a specific signature or those belonging to a specific class. The return value is NULL if none is found. from is used to continue a search; it must be NULL the first time you call either function, and it must point to the device just found if you are searching for more devices. These functions are not recommended to be used, use the pci_get_ variants instead.

struct pci_dev *pci_get_device(unsigned int vendor, unsigned int device,

,struct pci_dev *from);

struct pci_dev *pci_get_subsys(unsigned int vendor, unsigned int device

unsigned int ss_vendor, unsigned int ss_device, struct pci_dev *from);

struct pci_dev *pci_get_slot(struct pci_bus *bus, unsigned int devfn);
struct pci_dev *pci_get_device(unsigned int vendor, unsigned int device,

struct pci_dev *from);

struct pci_dev *pci_get_subsys(unsigned int vendor, unsigned int device,

unsigned int ss_vendor, unsigned int ss_device, struct pci_dev *from);

struct pci_dev *pci_get_slot(struct pci_bus *bus, unsigned int devfn);

搜索设备的功能 具有特定签名或属于特定类别的设备的列表。NULL如果没有找到则返回值。from用于继续搜索;它必须是NULL您第一次调用任一函数,并且如果您正在搜索更多设备,它必须指向刚刚找到的设备。返回的结构体的引用计数递增,调用者使用完该结构体后,必须调用函数pci_dev_put 。

Functions that search the device list for devices with a specific signature or belonging to a specific class. The return value is NULL if none is found. from is used to continue a search; it must be NULL the first time you call either function, and it must point to the device just found if you are searching for more devices. The structure returned has its reference count incremented, and after the caller is finished with it, the function pci_dev_put must be called.

int pci_read_config_byte(struct pci_dev *dev, int where, u8 *val);

int pci_read_config_word(struct pci_dev *dev, int where, u16 *val);

int pci_read_config_dword(struct pci_dev *dev, int where, u32 *val);

int pci_write_config_byte (struct pci_dev *dev, int where, u8 *val);

int pci_write_config_word (struct pci_dev *dev, int where, u16 *val);

int pci_write_config_dword (struct pci_dev *dev, int where, u32 *val);
int pci_read_config_byte(struct pci_dev *dev, int where, u8 *val);

int pci_read_config_word(struct pci_dev *dev, int where, u16 *val);

int pci_read_config_dword(struct pci_dev *dev, int where, u32 *val);

int pci_write_config_byte (struct pci_dev *dev, int where, u8 *val);

int pci_write_config_word (struct pci_dev *dev, int where, u16 *val);

int pci_write_config_dword (struct pci_dev *dev, int where, u32 *val);

读取或写入 PCI 配置寄存器的函数。尽管 Linux 内核负责字节排序,但程序员在从各个字节组装多字节值时必须小心字节排序。PCI 总线是小尾数法。

Functions that read or write a PCI configuration register. Although the Linux kernel takes care of byte ordering, the programmer must be careful about byte ordering when assembling multibyte values from individual bytes. The PCI bus is little-endian.

int pci_enable_device(struct pci_dev *dev);
int pci_enable_device(struct pci_dev *dev);

启用 PCI 设备。

Enables a PCI device.

unsigned long pci_resource_start(struct pci_dev *dev, int bar);

unsigned long pci_resource_end(struct pci_dev *dev, int bar);

unsigned long pci_resource_flags(struct pci_dev *dev, int bar);
unsigned long pci_resource_start(struct pci_dev *dev, int bar);

unsigned long pci_resource_end(struct pci_dev *dev, int bar);

unsigned long pci_resource_flags(struct pci_dev *dev, int bar);

处理 PCI 设备资源的函数。

Functions that handle PCI device resources.




[ 1 ]某些体系结构还在/proc/pci/proc/bus/pci文件中显示 PCI 域信息 。

[1] Some architectures also display the PCI domain information in the /proc/pci and /proc/bus/pci files.

[ 2 ]实际上,该配置并不限于系统启动的时间;例如,热插拔设备在启动时不可用,而是稍后出现。这里的要点是设备驱动程序不得更改 I/O 或内存区域的地址。

[2] Actually, that configuration is not restricted to the time the system boots; hotpluggable devices, for example, cannot be available at boot time and appear later instead. The main point here is that the device driver must not change the address of I/O or memory regions.

[ 3 ]您可以在任何设备的硬件手册中找到其 ID。文件pci.ids中包含一个列表,它是 pciutils包和内核源代码的一部分;它并不假装完整,只是列出了最著名的供应商和设备。该文件的内核版本将不会包含在未来的内核系列中。

[3] You'll find the ID of any device in its own hardware manual. A list is included in the file pci.ids, part of the pciutils package and the kernel sources; it doesn't pretend to be complete but just lists the most renowned vendors and devices. The kernel version of this file will not be included in future kernel series.

[ 4 ]该信息位于 PCI 基址寄存器的低位之一中。这些位在<linux/pci.h>中定义。

[4] The information lives in one of the low-order bits of the base address PCI registers. The bits are defined in <linux/pci.h>.

[ 5 ]中断共享的问题是电气工程的问题:如果设备通过应用低阻抗电压电平将信号线驱动为非活动状态,则中断无法共享。另一方面,如果器件使用上拉电阻来连接无效逻辑电平,则可以实现共享。这是当今的常态。然而,由于 ISA 中断是边沿触发而不是电平触发,因此仍然存在丢失中断事件的潜在风险。边沿触发的中断更容易在硬件中实现,但不适合安全共享。

[5] The problem with interrupt sharing is a matter of electrical engineering: if a device drives the signal line inactive—by applying a low-impedance voltage level—the interrupt can't be shared. If, on the other hand, the device uses a pull-up resistor to the inactive logic level, sharing is possible. This is the norm nowadays. However, there's still a potential risk of losing interrupt events since ISA interrupts are edge triggered instead of level triggered. Edge-triggered interrupts are easier to implement in hardware but don't lend themselves to safe sharing.

第 13 章 USB 驱动程序

Chapter 13. USB Drivers

通用串行总线(USB)是一种 联系 主机和许多外围设备之间。它最初的创建是为了用所有设备都可以连接的单一总线类型来取代各种缓慢且不同的总线(并行、串行和键盘连接)。[ 1 ] USB 已经超越了这些缓慢的连接,现在支持几乎所有类型的可以连接到 PC 的设备。USB 规范的最新版本增加了高速连接,理论速度限制为 480 MBps。

The universal serial bus (USB) is a connection between a host computer and a number of peripheral devices. It was originally created to replace a wide range of slow and different buses—the parallel, serial, and keyboard connections—with a single bus type that all devices could connect to.[1] USB has grown beyond these slow connections and now supports almost every type of device that can be connected to a PC. The latest revision of the USB specification added high-speed connections with a theoretical speed limit of 480 MBps.

从拓扑上来说,USB 子系统并不是总线的布局,而是总线的布局。它更像是一棵由多个点对点链接构建而成的树。这些链路是连接设备和集线器的四线电缆(接地线、电源线和两根信号线),就像双绞线以太网一样。USB主机控制器负责询问每个USB设备是否有数据要发送。由于这种拓扑结构,如果没有主机控制器首先请求,USB 设备永远无法开始发送数据。这种配置允许非常简单的即插即用类型的系统,其中设备可以由主机自动配置。

Topologically, a USB subsystem is not laid out as a bus; it is rather a tree built out of several point-to-point links. The links are four-wire cables (ground, power, and two signal wires) that connect a device and a hub, just like twisted-pair Ethernet. The USB host controller is in charge of asking every USB device if it has any data to send. Because of this topology, a USB device can never start sending data without first being asked to by the host controller. This configuration allows for a very easy plug-and-play type of system, whereby devices can be automatically configured by the host computer.

该总线在技术层面上非常简单,因为它是单主机实现,其中主机轮询各种外围设备。尽管存在这种固有的限制,总线仍具有一些有趣的功能,例如设备能够为其数据传输请求固定带宽,以便可靠地支持视频和音频 I/O。USB 的另一个重要特征是,它仅充当设备和主机之间的通信通道,无需对其传送的数据进行特定的含义或结构。[ 2 ]

The bus is very simple at the technological level, as it's a single-master implementation in which the host computer polls the various peripheral devices. Despite this intrinsic limitation, the bus has some interesting features, such as the ability for a device to request a fixed bandwidth for its data transfers in order to reliably support video and audio I/O. Another important feature of USB is that it acts merely as a communication channel between the device and the host, without requiring specific meaning or structure to the data it delivers.[2]

USB 协议规范定义了一组任何特定类型的设备都可以遵循的标准。如果设备遵循该标准,则不需要该设备的特殊驱动程序。这些不同的类型称为类,由存储设备、键盘、鼠标、操纵杆、网络设备和调制解调器等组成。不属于这些类别的其他类型的设备需要为该特定设备编写特殊的供应商特定驱动程序。视频设备和 USB 转串口设备就是一个很好的例子,它们没有定义的标准,并且不同制造商的每种不同设备都需要一个驱动程序。

The USB protocol specifications define a set of standards that any device of a specific type can follow. If a device follows that standard, then a special driver for that device is not necessary. These different types are called classes and consist of things like storage devices, keyboards, mice, joysticks, network devices, and modems. Other types of devices that do not fit into these classes require a special vendor-specific driver to be written for that specific device. Video devices and USB-to-serial devices are a good example where there is no defined standard, and a driver is needed for every different device from different manufacturers.

这些功能加上设计固有的热插拔功能,使 USB 成为一种方便、低成本的机制,可将多个设备连接(和断开)到计算机,而无需关闭系统、打开机盖和拧紧螺丝和电线。

These features, together with the inherent hotplug capability of the design, make USB a handy, low-cost mechanism to connect (and disconnect) several devices to the computer without the need to shut the system down, open the cover, and swear over screws and wires.

Linux 内核支持两种主要类型的 USB 驱动程序:主机系统上的驱动程序和设备上的驱动程序。从主机的角度来看,主机系统的 USB 驱动程序控制插入其中的 USB 设备(常见的 USB 主机是台式计算机)。设备中的 USB 驱动程序控制单个设备在主机看来的方式计算机作为 USB 设备。由于术语“USB 设备驱动程序”非常令人困惑,USB 开发人员创建了术语“USB 小工具驱动程序”来描述控制连接到计算机的 USB 设备的驱动程序(请记住 Linux 也运行在那些微型嵌入式设备中, )本章详细介绍了在台式计算机上运行的 USB 系统的工作原理。USB 小工具驱动程序目前不在本书的讨论范围之内。

The Linux kernel supports two main types of USB drivers: drivers on a host system and drivers on a device. The USB drivers for a host system control the USB devices that are plugged into it, from the host's point of view (a common USB host is a desktop computer.) The USB drivers in a device, control how that single device looks to the host computer as a USB device. As the term "USB device drivers" is very confusing, the USB developers have created the term "USB gadget drivers" to describe the drivers that control a USB device that connects to a computer (remember that Linux also runs in those tiny embedded devices, too.) This chapter details how the USB system that runs on a desktop computer works. USB gadget drivers are outside the realm of this book at this point in time.

正如第 13 章所示,USB 驱动程序存在于不同的内核子系统(block、net、char 等)和 USB 硬件控制器之间。USB 内核为 USB 驱动程序提供了一个接口,用于访问和控制 USB 硬件,而不必担心系统上存在的不同类型的 USB 硬件控制器。

As Chapter 13 shows, USB drivers live between the different kernel subsytems (block, net, char, etc.) and the USB hardware controllers. The USB core provides an interface for USB drivers to use to access and control the USB hardware, without having to worry about the different types of USB hardware controllers that are present on the system.

USB 驱动程序概述

图 13-1。USB 驱动程序概述

Figure 13-1. USB driver overview

USB 设备基础知识

USB Device Basics

USB 设备是一个非常复杂的东西,如官方 USB 文档(可在http://www.usb.org上获取)中所述。幸运的是,Linux 内核提供了一个称为USB 核心的子系统来处理大部分复杂性。本章描述驱动程序和 USB 内核之间的交互。图 13-1显示了 USB 设备如何由配置、接口和端点组成,以及 USB 驱动程序如何绑定到 USB 接口,而不是整个 USB 设备。

A USB device is a very complex thing, as described in the official USB documentation (available at http://www.usb.org). Fortunately, the Linux kernel provides a subsystem called the USB core to handle most of the complexity. This chapter describes the interaction between a driver and the USB core. Figure 13-1 shows how USB devices consist of configurations, interfaces, and endpoints and how USB drivers bind to USB interfaces, not the entire USB device.

USB 设备概述

图 13-2。USB 设备概述

Figure 13-2. USB device overview

端点

Endpoints

最基本的形式USB 通信是通过称为端点的东西进行的。USB 端点只能在一个方向上传输数据,即从主机到设备(称为 OUT端点)或从设备到主机(称为 IN端点)。端点可以被认为是单向管道。

The most basic form of USB communication is through something called an endpoint. A USB endpoint can carry data in only one direction, either from the host computer to the device (called an OUT endpoint) or from the device to the host computer (called an IN endpoint). Endpoints can be thought of as unidirectional pipes.

USB 端点可以是描述数据传输方式的四种不同类型之一:

A USB endpoint can be one of four different types that describe how the data is transmitted:

控制
CONTROL

控制端点用于允许访问 USB 设备的不同部分。它们通常用于配置设备、检索有关设备的信息、向设备发送命令或检索有关设备的状态报告。这些端点的尺寸通常很小。每个 USB 设备都有一个称为“端点 0”的控制端点,USB 核心使用该端点在插入时配置设备。USB 协议保证这些传输始终有足够的保留带宽来传输到设备。

Control endpoints are used to allow access to different parts of the USB device. They are commonly used for configuring the device, retrieving information about the device, sending commands to the device, or retrieving status reports about the device. These endpoints are usually small in size. Every USB device has a control endpoint called "endpoint 0" that is used by the USB core to configure the device at insertion time. These transfers are guaranteed by the USB protocol to always have enough reserved bandwidth to make it through to the device.

打断
INTERRUPT

每次 USB 主机向设备请求数据时,中断端点都会以固定速率传输少量数据。这些端点是 USB 键盘和鼠标的主要传输方法。它们也常用于向 USB 设备发送数据以控制设备,但通常不用于传输大量数据。USB 协议保证这些传输始终有足够的预留带宽来完成。

Interrupt endpoints transfer small amounts of data at a fixed rate every time the USB host asks the device for data. These endpoints are the primary transport method for USB keyboards and mice. They are also commonly used to send data to USB devices to control the device, but are not generally used to transfer large amounts of data. These transfers are guaranteed by the USB protocol to always have enough reserved bandwidth to make it through.

大部分
BULK

批量端点传输大量数据。这些端点通常比中断端点大得多(它们可以一次容纳更多字符)。对于需要传输必须不丢失数据的任何数据的设备来说,它们很常见。USB 协议不保证这些传输始终在特定时间内完成。如果总线上没有足够的空间来发送整个 BULK 数据包,则会将其分成多个往返设备的传输。这些端点在打印机、存储和网络设备上很常见。

Bulk endpoints transfer large amounts of data. These endpoints are usually much larger (they can hold more characters at once) than interrupt endpoints. They are common for devices that need to transfer any data that must get through with no data loss. These transfers are not guaranteed by the USB protocol to always make it through in a specific amount of time. If there is not enough room on the bus to send the whole BULK packet, it is split up across multiple transfers to or from the device. These endpoints are common on printers, storage, and network devices.

等时
ISOCHRONOUS

等时端点也传输大量数据,但并不总是保证数据能够通过。这些端点用于可以处理数据丢失的设备,并且更多地依赖于保持恒定的数据流。实时数据收集,例如音频和视频设备,几乎总是使用这些端点。

Isochronous endpoints also transfer large amounts of data, but the data is not always guaranteed to make it through. These endpoints are used in devices that can handle loss of data, and rely more on keeping a constant stream of data flowing. Real-time data collections, such as audio and video devices, almost always use these endpoints.

每当驱动程序决定使用控制端点和批量端点时,它们都用于异步数据传输。中断和等时端点是周期性的。这意味着这些端点被设置为在固定时间连续传输数据,这导致它们的带宽被 USB 核心保留。

Control and bulk endpoints are used for asynchronous data transfers, whenever the driver decides to use them. Interrupt and isochronous endpoints are periodic. This means that these endpoints are set up to transfer data at fixed times continuously, which causes their bandwidth to be reserved by the USB core.

USB端点在内核中用结构体进行描述struct usb_host_endpoint。该结构在另一个名为 的结构中包含真实端点信息struct usb_endpoint_descriptor。后一种结构包含所有 USB 特定数据,其格式与设备本身指定的格式完全相同。驱动程序关心的这个结构体的字段是:

USB endpoints are described in the kernel with the structure struct usb_host_endpoint. This structure contains the real endpoint information in another structure called struct usb_endpoint_descriptor. The latter structure contains all of the USB-specific data in the exact format that the device itself specified. The fields of this structure that drivers care about are:

bEndpointAddress
bEndpointAddress

这是该特定端点的 USB 地址。该 8 位值中还包含端点的方向。位掩码USB_DIR_OUTUSB_DIR_IN可以针对此字段放置,以确定此端点的数据是定向到设备还是主机。

This is the USB address of this specific endpoint. Also included in this 8-bit value is the direction of the endpoint. The bitmasks USB_DIR_OUT and USB_DIR_IN can be placed against this field to determine if the data for this endpoint is directed to the device or to the host.

bmAttributes
bmAttributes

这是端点的类型。应根据该值放置位掩码USB_ENDPOINT_XFERTYPE_MASK,以确定端点的类型是否为USB_ENDPOINT_XFER_ISOCUSB_ENDPOINT_XFER_BULKUSB_ENDPOINT_XFER_INT。这些宏分别定义同步端点、批量端点和中断端点。

This is the type of endpoint. The bitmask USB_ENDPOINT_XFERTYPE_MASK should be placed against this value in order to determine if the endpoint is of type USB_ENDPOINT_XFER_ISOC, USB_ENDPOINT_XFER_BULK, or of type USB_ENDPOINT_XFER_INT. These macros define a isochronous, bulk, and interrupt endpoint, respectively.

wMaxPacketSize
wMaxPacketSize

这是该端点可以一次处理的最大大小(以字节为单位)。请注意,驱动程序可以向端点发送大于此值的数据量,但wMaxPacketSize在实际传输到设备时数据将被分成块。对于高速设备,该字段可用于通过在值的上部使用一些额外位来支持端点的高带宽模式。有关如何完成此操作的更多详细信息,请参阅 USB 规范。

This is the maximum size in bytes that this endpoint can handle at once. Note that it is possible for a driver to send amounts of data to an endpoint that is bigger than this value, but the data will be divided up into wMaxPacketSize chunks when actually transmitted to the device. For high-speed devices, this field can be used to support a high-bandwidth mode for the endpoint by using a few extra bits in the upper part of the value. See the USB specification for more details about how this is done.

bInterval
bInterval

如果该端点是中断类型,则该值是端点的间隔设置,即端点的中断请求之间的时间。该值以毫秒表示。

If this endpoint is of type interrupt, this value is the interval setting for the endpoint—that is, the time between interrupt requests for the endpoint. The value is represented in milliseconds.

该结构的字段没有“传统”Linux 内核命名方案。这是因为这些字段直接对应于 USB 规范中的字段名称。USB 内核程序员认为使用指定的名称比使用 Linux 程序员熟悉的变量名称更重要,这样可以减少阅读规范时的混乱。

The fields of this structure do not have a "traditional" Linux kernel naming scheme. This is because these fields directly correspond to the field names in the USB specification. The USB kernel programmers felt that it was more important to use the specified names, so as to reduce confusion when reading the specification, than it was to have variable names that look familiar to Linux programmers.

接口

Interfaces

USB 端点捆绑在一起 进入 接口。USB 接口仅处理一种类型的 USB 逻辑连接,例如鼠标、键盘或音频流。某些 USB 设备具有多个接口,例如 USB 扬声器可能包含两个接口:用于按钮的 USB 键盘和 USB 音频流。由于 USB 接口代表基本功能,因此每个 USB 驱动程序都控制一个接口;因此,对于扬声器示例,Linux 需要为一个硬件设备提供两个不同的驱动程序。

USB endpoints are bundled up into interfaces. USB interfaces handle only one type of a USB logical connection, such as a mouse, a keyboard, or a audio stream. Some USB devices have multiple interfaces, such as a USB speaker that might consist of two interfaces: a USB keyboard for the buttons and a USB audio stream. Because a USB interface represents basic functionality, each USB driver controls an interface; so, for the speaker example, Linux needs two different drivers for one hardware device.

USB 接口可以具有替代设置,这是接口参数的不同选择。接口的初始状态是第一个设置,编号为 0。备用设置可用于以不同方式控制各个端点,例如为设备保留不同数量的 USB 带宽。每个具有同步端点的设备都使用同一接口的备用设置。

USB interfaces may have alternate settings, which are different choices for parameters of the interface. The initial state of a interface is in the first setting, numbered 0. Alternate settings can be used to control individual endpoints in different ways, such as to reserve different amounts of USB bandwidth for the device. Each device with an isochronous endpoint uses alternate settings for the same interface.

USB接口在内核中用struct usb_interface结构体进行了描述。该结构是 USB 核心传递给 USB 驱动程序的结构,也是 USB 驱动程序负责控制的结构。该结构中的重要字段是:

USB interfaces are described in the kernel with the struct usb_interface structure. This structure is what the USB core passes to USB drivers and is what the USB driver then is in charge of controlling. The important fields in this structure are:

struct usb_host_interface *altsetting
struct usb_host_interface *altsetting

包含可以为此界面选择的所有替代设置的界面结构数组。每个都由一组由上述结构struct usb_host_interface定义的端点配置组成。struct usb_host_endpoint请注意,这些接口结构没有特定的顺序。

An array of interface structures containing all of the alternate settings that may be selected for this interface. Each struct usb_host_interface consists of a set of endpoint configurations as defined by the struct usb_host_endpoint structure described above. Note that these interface structures are in no particular order.

unsigned num_altsetting
unsigned num_altsetting

指针指向的备用设置的数量altsetting

The number of alternate settings pointed to by the altsetting pointer.

struct usb_host_interface *cur_altsetting
struct usb_host_interface *cur_altsetting

指向数组的指针altsetting,表示该接口当前活动的设置。

A pointer into the array altsetting, denoting the currently active setting for this interface.

int minor
int minor

如果绑定到该接口的 USB 驱动程序使用 USB 主设备号,则该变量包含 USB 内核分配给该接口的次设备号。这仅在成功调用后才有效usb_register_dev(本章稍后将介绍)。

If the USB driver bound to this interface uses the USB major number, this variable contains the minor number assigned by the USB core to the interface. This is valid only after a successful call to usb_register_dev (described later in this chapter).

里面还有其他字段 struct usb_interface结构,但 USB 驱动程序不需要了解它们。

There are other fields in the struct usb_interface structure, but USB drivers do not need to be aware of them.

配置

Configurations

USB接口是t 自己捆绑成 配置。USB 设备可以有多种配置,并且可以在它们之间切换以更改设备的状态。例如,一些允许下载固件的设备包含多种配置来完成此操作。单一配置只能在一个时间点启用。Linux 不能很好地处理多个配置 USB 设备,但幸运的是,它们很少见。

USB interfaces are t hemselves bundled up into configurations. A USB device can have multiple configurations and might switch between them in order to change the state of the device. For example, some devices that allow firmware to be downloaded to them contain multiple configurations to accomplish this. A single configuration can be enabled only at one point in time. Linux does not handle multiple configuration USB devices very well, but, thankfully, they are rare.

Linux 用结构描述了 USB 配置struct usb_host_config,并用结构描述了整个 USB 设备struct usb_device。USB 设备驱动程序通常不需要读取或写入这些结构中的任何值,因此此处不对它们进行详细定义。好奇的读者可以在内核源代码树的include/linux/usb.h文件中找到它们的描述。

Linux describes USB configurations with the structure struct usb_host_config and entire USB devices with the structure struct usb_device. USB device drivers do not generally ever need to read or write to any values in these structures, so they are not defined in detail here. The curious reader can find descriptions of them in the file include/linux/usb.h in the kernel source tree.

USB 设备驱动程序通常必须将给定struct usb_interface结构中的数据转换为struct usb_deviceUSB 内核进行各种函数调用所需的结构。为此, 提供了函数interface_to_usbdev 。希望将来所有当前需要的 USB 调用都struct usb_device将转换为采用struct usb_interface参数,并且不需要驱动程序进行转换。

A USB device driver commonly has to convert data from a given struct usb_interface structure into a struct usb_device structure that the USB core needs for a wide range of function calls. To do this, the function interface_to_usbdev is provided. Hopefully, in the future, all USB calls that currently need a struct usb_device will be converted to take a struct usb_interface parameter and will not require the drivers to do the conversion.

总而言之,USB 设备非常复杂,由许多不同的部件组成 逻辑单元。这些单元之间的关系可以简单描述如下:

So to summarize, USB devices are quite complex and are made up of lots of different logical units. The relationships among these units can be simply described as follows:

  • 设备通常具有一种或多种配置。

  • Devices usually have one or more configurations.

  • 配置通常具有一个或多个接口。

  • Configurations often have one or more interfaces.

  • 界面通常具有一项或多项设置。

  • Interfaces usually have one or more settings.

  • 接口有零个或多个端点。

  • Interfaces have zero or more endpoints.

USB 和 Sysfs

USB and Sysfs

由于单一的复杂性 USB物理设备,该设备在sysfs中的表示也相当复杂。物理 USB 设备(由 a 表示struct usb_device)和各个 USB 接口(由 a 表示struct usb_interface)都在 sysfs 中显示为各个设备。(这是因为这两个结构都包含一个struct device结构。)例如,对于仅包含一个 USB 接口的简单 USB 鼠标,以下是该设备的 sysfs 目录树:

Due to the complexity of a single USB physical device, the representation of that device in sysfs is also quite complex. Both the physical USB device (as represented by a struct usb_device) and the individual USB interfaces (as represented by a struct usb_interface) are shown in sysfs as individual devices. (This is because both of those structures contain a struct device structure.) As an example, for a simple USB mouse that contains only one USB interface, the following would be the sysfs directory tree for that device:

/sys/devices/pci0000:00/0000:00:09.0/usb2/2-1
|-- 2-1:1.0
| |-- b备用设置
| |-- bInterfaceClass
| |-- b接口编号
| |-- bInterface协议
| |-- bInterfaceSubClass
| |-- bNumEndpoints
| |-- 分离状态
| |-- i界面
| `-- 功率
| `-- 状态
|-- bConfigurationValue
|-- bDeviceClass
|-- bDeviceProtocol
|-- bDeviceSubClass
|-- bMaxPower
|-- bNumConfigurations
|-- bNumInterfaces
|-- bcd设备
|-- bmAttributes
|-- 分离状态
|-- devnum
|-- id产品
|-- idVendor
|-- maxchild
|-- 电源
| `-- 状态
|-- 速度
`--版本
/sys/devices/pci0000:00/0000:00:09.0/usb2/2-1
|-- 2-1:1.0
|   |-- bAlternateSetting
|   |-- bInterfaceClass
|   |-- bInterfaceNumber
|   |-- bInterfaceProtocol
|   |-- bInterfaceSubClass
|   |-- bNumEndpoints
|   |-- detach_state
|   |-- iInterface
|   `-- power
|       `-- state
|-- bConfigurationValue
|-- bDeviceClass
|-- bDeviceProtocol
|-- bDeviceSubClass
|-- bMaxPower
|-- bNumConfigurations
|-- bNumInterfaces
|-- bcdDevice
|-- bmAttributes
|-- detach_state
|-- devnum
|-- idProduct
|-- idVendor
|-- maxchild
|-- power
|   `-- state
|-- speed
`-- version

struct usb_device在树中表示为:

The struct usb_device is represented in the tree at:

/sys/devices/pci0000:00/0000:00:09.0/usb2/2-1
/sys/devices/pci0000:00/0000:00:09.0/usb2/2-1

而鼠标的 USB 接口(USB 鼠标驱动程序绑定的接口)位于以下目录:

while the USB interface for the mouse—the interface that the USB mouse driver is bound to—is located at the directory:

/sys/devices/pci0000:00/0000:00:09.0/usb2/2-1/2-1:1.0
/sys/devices/pci0000:00/0000:00:09.0/usb2/2-1/2-1:1.0

为了帮助理解这个长设备路径的含义,我们描述了内核如何标记 USB 设备。

To help understand what this long device path means, we describe how the kernel labels the USB devices.

第一个 USB 设备是根集线器。这是 USB 控制器,通常包含在 PCI 设备中。该控制器之所以如此命名,是因为它控制与其连接的整个 USB 总线。该控制器是 PCI 总线和 USB 总线之间的桥梁,也是该总线上的第一个 USB 设备。

The first USB device is a root hub. This is the USB controller, usually contained in a PCI device. The controller is so named because it controls the whole USB bus connected to it. The controller is a bridge between the PCI bus and the USB bus, as well as being the first USB device on that bus.

全部 USB 核心为根集线器分配了一个唯一的编号。在我们的示例中,根集线器称为 usb2,因为它是向 USB 核心注册的第二个根集线器。任何时候单个系统中可以包含的根集线器的数量没有限制。

All root hubs are assigned a unique number by the USB core. In our example, the root hub is called usb2, as it is the second root hub that was registered with the USB core. There is no limit on the number of root hubs that can be contained in a single system at any time.

USB 总线上的每个设备都将根集线器的编号作为其名称中的第一个数字。后面跟着一个-字符,然后是设备插入的端口号。当我们示例中的设备插入第一个端口时,1名称中会添加 a。因此,主 USB 鼠标设备的设备名称是2-1。由于此 USB 设备包含一个接口,因此会导致树中的另一设备被添加到 sysfs 路径中。到目前为止,USB 接口的命名方案是设备名称:在我们的示例中,它2-1后面跟着一个冒号和 USB 配置编号,然后是一个句点和接口编号。因此对于本例来说,设备名称是2-1:1.0因为它是第一个配置并且接口编号为零。

Every device that is on a USB bus takes the number of the root hub as the first number in its name. That is followed by a - character and then the number of the port that the device is plugged into. As the device in our example is plugged into the first port, a 1 is added to the name. So the device name for the main USB mouse device is 2-1. Because this USB device contains one interface, that causes another device in the tree to be added to the sysfs path. The naming scheme for USB interfaces is the device name up to this point: in our example, it's 2-1 followed by a colon and the USB configuration number, then a period and the interface number. So for this example, the device name is 2-1:1.0 because it is the first configuration and has interface number zero.

总而言之,USB sysfs 设备命名方案是:

So to summarize, the USB sysfs device naming scheme is:

            root_hub- hub_port: configinterface
            root_hub-hub_port:config.interface

随着设备在 USB 树中进一步向下移动,并且使用越来越多的 USB 集线器,集线器端口号将添加到链中前一个集线器端口号之后的字符串中。对于两层深度的树,设备名称如下所示:

As the devices go further down in the USB tree, and as more and more USB hubs are used, the hub port number is added to the string following the previous hub port number in the chain. For a two-deep tree, the device name looks like:

            root_hub- hub_port- hub_port: configinterface
            root_hub-hub_port-hub_port:config.interface

从前面的 USB 设备和接口的目录列表中可以看出,所有 USB 特定信息都可以直接通过 sysfs 获得(例如 idVendor、idProduct 和 bMaxPower 信息)。可以写入这些文件之一bConfigurationValue以更改正在使用的活动 USB 配置。当内核无法确定选择哪种配置才能正确操作设备时,这对于具有多种配置的设备非常有用。许多 USB 调制解调器需要将正确的配置值写入此文件,以便将正确的 USB 驱动程序绑定到设备。

As can be seen in the previous directory listing of the USB device and interface, all of the USB specific information is available directly through sysfs (for example, the idVendor, idProduct, and bMaxPower information). One of these files, bConfigurationValue, can be written to in order to change the active USB configuration that is being used. This is useful for devices that have multiple configurations, when the kernel is unable to determine what configuration to select in order to properly operate the device. A number of USB modems need to have the proper configuration value written to this file in order to have the correct USB driver bind to the device.

Sysfs 不会公开 USB 设备的所有不同部分,因为它停留在接口级别。未显示设备可能包含的任何替代配置以及与接口关联的端点的详细信息。此信息可以在usbfs文件系统中找到,该文件系统安装在 系统的/proc/bus/usb/目录中。文件 /proc/bus/usb/devices确实显示了 sysfs 中公开的所有相同信息,以及系统中存在的所有 USB 设备的备用配置和端点信息。USBFS它还允许用户空间程序直接与USB设备通信,这使得许多内核驱动程序可以移出到用户空间,在那里更容易维护和调试。USB 扫描仪驱动程序就是一个很好的例子,因为它的功能现在包含在用户空间中,因此它不再存在于内核中 SANE 库程序。

Sysfs does not expose all of the different parts of a USB device, as it stops at the interface level. Any alternate configurations that the device may contain are not shown, as well as the details of the endpoints associated with the interfaces. This information can be found in the usbfs filesystem, which is mounted in the /proc/bus/usb/ directory on the system. The file /proc/bus/usb/devices does show all of the same information exposed in sysfs, as well as the alternate configuration and endpoint information for all USB devices that are present in the system. usbfs also allows user-space programs to directly talk to USB devices, which has enabled a lot of kernel drivers to be moved out to user space, where it is easier to maintain and debug. The USB scanner driver is a good example of this, as it is no longer present in the kernel because its functionality is now contained in the user-space SANE library programs.

USB 接口

USB Urbs

Linux内核中的USB代码 使用称为urb的东西与所有 USB 设备进行通信(USB 请求块)。该请求块用结构体进行了描述,可以在include/linux/usb.hstruct urb文件中找到。

The USB code in the Linux kernel communicates with all USB devices using something called a urb (USB request block). This request block is described with the struct urb structure and can be found in the include/linux/usb.h file.

urb 用于以异步方式向特定 USB 设备上的特定 USB 端点发送数据或从特定 USB 端点接收数据。它的使用方式很像kiocb文件系统异步 I/O 代码中使用的结构或struct skbuff网络代码中使用的结构。USB 设备驱动程序可以为单个端点分配多个 urb,也可以为许多不同的端点重用单个 urb,具体取决于驱动程序的需要。设备中的每个端点都可以处理 urb 队列,以便在队列为空之前可以将多个 urb 发送到同一端点。典型的生命周期 一个 urb 的结构如下:

A urb is used to send or receive data to or from a specific USB endpoint on a specific USB device in an asynchronous manner. It is used much like a kiocb structure is used in the filesystem async I/O code or as a struct skbuff is used in the networking code. A USB device driver may allocate many urbs for a single endpoint or may reuse a single urb for many different endpoints, depending on the need of the driver. Every endpoint in a device can handle a queue of urbs, so that multiple urbs can be sent to the same endpoint before the queue is empty. The typical lifecycle of a urb is as follows:

  • 由 USB 设备驱动程序创建。

  • Created by a USB device driver.

  • 分配给特定 USB 设备的特定端点。

  • Assigned to a specific endpoint of a specific USB device.

  • 由USB设备驱动程序提交给USB核心。

  • Submitted to the USB core, by the USB device driver.

  • 由USB核心提交给指定设备的特定USB主机控制器驱动程序。

  • Submitted to the specific USB host controller driver for the specified device by the USB core.

  • 由 USB 主机控制器驱动程序处理,该驱动程序向设备进行 USB 传输。

  • Processed by the USB host controller driver that makes a USB transfer to the device.

  • 当urb完成时,USB主机控制器驱动程序通知USB设备驱动程序。

  • When the urb is completed, the USB host controller driver notifies the USB device driver.

Urb 也可以随时由提交 urb 的驱动程序取消,或者在设备从系统中删除时由 USB 核心取消。urb 是动态创建的,并包含一个内部引用计数,使它们能够在 urb 的最后一个用户释放它时自动释放。

Urbs can also be canceled any time by the driver that submitted the urb, or by the USB core if the device is removed from the system. urbs are dynamically created and contain an internal reference count that enables them to be automatically freed when the last user of the urb releases it.

本章中描述的处理 urb 的过程很有用,因为它允许流式传输和其他复杂的重叠通信,从而允许驱动程序实现尽可能高的数据传输速度。但是,如果您只想发送单独的批量或控制消息并且不关心数据吞吐率,则可以使用不太麻烦的过程。(参见第 13.5 节。)

The procedure described in this chapter for handling urbs is useful, because it permits streaming and other complex, overlapping communications that allow drivers to achieve the highest possible data transfer speeds. But less cumbersome procedures are available if you just want to send individual bulk or control messages and do not care about data throughput rates. (See the Section 13.5.)

结构体

struct urb

的领域 struct urb对于 USB 设备驱动程序来说重要的结构是:

The fields of the struct urb structure that matter to a USB device driver are:

struct usb_device *dev
struct usb_device *dev

struct usb_device指向 urb 发送到的指针。在 urb 被发送到 USB 核心之前,该变量必须由 USB 驱动程序初始化。

Pointer to the struct usb_device to which this urb is sent. This variable must be initialized by the USB driver before the urb can be sent to the USB core.

unsigned int pipe
unsigned int pipe

struct usb_device该 urb 要发送到的特定端点信息。在 urb 被发送到 USB 核心之前,该变量必须由 USB 驱动程序初始化。

Endpoint information for the specific struct usb_device that this urb is to be sent to. This variable must be initialized by the USB driver before the urb can be sent to the USB core.

为了设置此结构的字段,驱动程序根据交通方向适当使用以下函数。请注意,每个端点只能属于一种类型。

To set fields of this structure, the driver uses the following functions as appropriate, depending on the direction of traffic. Note that every endpoint can be of only one type.

unsigned int usb_sndctrlpipe(struct usb_device *dev, unsigned int

endpoint)
unsigned int usb_sndctrlpipe(struct usb_device *dev, unsigned int

endpoint)

为具有指定端点编号的指定 USB 设备指定控制 OUT 端点。

Specifies a control OUT endpoint for the specified USB device with the specified endpoint number.

unsigned int usb_rcvctrlpipe(struct usb_device *dev, unsigned int

endpoint)
unsigned int usb_rcvctrlpipe(struct usb_device *dev, unsigned int

endpoint)

为具有指定端点编号的指定 USB 设备指定控制 IN 端点。

Specifies a control IN endpoint for the specified USB device with the specified endpoint number.

unsigned int usb_sndbulkpipe(struct usb_device *dev, unsigned int

endpoint)
unsigned int usb_sndbulkpipe(struct usb_device *dev, unsigned int

endpoint)

为具有指定端点编号的指定 USB 设备指定批量 OUT 端点。

Specifies a bulk OUT endpoint for the specified USB device with the specified endpoint number.

unsigned int usb_rcvbulkpipe(struct usb_device *dev, unsigned int

endpoint)
unsigned int usb_rcvbulkpipe(struct usb_device *dev, unsigned int

endpoint)

为具有指定端点编号的指定 USB 设备指定批量 IN 端点。

Specifies a bulk IN endpoint for the specified USB device with the specified endpoint number.

unsigned int usb_sndintpipe(struct usb_device *dev, unsigned int endpoint)
unsigned int usb_sndintpipe(struct usb_device *dev, unsigned int endpoint)

为具有指定端点号的指定 USB 设备指定中断 OUT 端点。

Specifies an interrupt OUT endpoint for the specified USB device with the specified endpoint number.

unsigned int usb_rcvintpipe(struct usb_device *dev, unsigned int endpoint)
unsigned int usb_rcvintpipe(struct usb_device *dev, unsigned int endpoint)

为具有指定端点号的指定 USB 设备指定中断 IN 端点。

Specifies an interrupt IN endpoint for the specified USB device with the specified endpoint number.

unsigned int usb_sndisocpipe(struct usb_device *dev, unsigned int

endpoint)
unsigned int usb_sndisocpipe(struct usb_device *dev, unsigned int

endpoint)

为具有指定端点号的指定 USB 设备指定同步 OUT 端点。

Specifies an isochronous OUT endpoint for the specified USB device with the specified endpoint number.

unsigned int usb_rcvisocpipe(struct usb_device *dev, unsigned int

endpoint)
unsigned int usb_rcvisocpipe(struct usb_device *dev, unsigned int

endpoint)

为具有指定端点号的指定 USB 设备指定同步 IN 端点。

Specifies an isochronous IN endpoint for the specified USB device with the specified endpoint number.

unsigned int transfer_flags
unsigned int transfer_flags

该变量可以设置为许多不同的位值,具体取决于 USB 驱动程序希望 urb 发生什么情况。可用值为:

This variable can be set to a number of different bit values, depending on what the USB driver wants to happen to the urb. The available values are:

URB_SHORT_NOT_OK
URB_SHORT_NOT_OK

设置后,它指定 USB 核心应将 IN 端点上可能发生的任何短读取视为错误。该值仅适用于要从 USB 设备读取的 urb,不适用于写入 urb。

When set, it specifies that any short read on an IN endpoint that might occur should be treated as an error by the USB core. This value is useful only for urbs that are to be read from the USB device, not for write urbs.

URB_ISO_ASAP
URB_ISO_ASAP

如果 urb 是同步的,如果驱动程序希望在带宽利用率允许的情况下调度 urb,则可以设置该位,并在此时设置 start_frameurb 中的变量。如果未为同步 urb 设置该位,则驱动程序必须指定该start_frame值,并且如果此时传输无法启动,则必须能够正确恢复。有关更多信息,请参阅即将到来的有关等时 urb 的部分。

If the urb is isochronous, this bit can be set if the driver wants the urb to be scheduled, as soon as the bandwidth utilization allows it to be, and to set the start_frame variable in the urb at that point. If this bit is not set for an isochronous urb, the driver must specify the start_frame value and must be able to recover properly if the transfer cannot start at that moment. See the upcoming section about isochronous urbs for more information.

URB_NO_TRANSFER_DMA_MAP
URB_NO_TRANSFER_DMA_MAP

当 urb 包含要传输的 DMA 缓冲区时应设置。USB 核心使用变量指向的缓冲区transfer_dma ,而不是变量指向的缓冲区transfer_buffer

Should be set when the urb contains a DMA buffer to be transferred. The USB core uses the buffer pointed to by the transfer_dma variable and not the buffer pointed to by the transfer_buffer variable.

URB_NO_SETUP_DMA_MAP
URB_NO_SETUP_DMA_MAP

与该URB_NO_TRANSFER_DMA_MAP位一样,该位用于已设置 DMA 缓冲区的控制 urb。如果设置了,USB 核心将使用该变量指向的缓冲区setup_dma而不是该setup_packet变量。

Like the URB_NO_TRANSFER_DMA_MAP bit, this bit is used for control urbs that have a DMA buffer already set up. If it is set, the USB core uses the buffer pointed to by the setup_dma variable instead of the setup_packet variable.

URB_ASYNC_UNLINK
URB_ASYNC_UNLINK

如果设置,则对该 urb 的usb_unlink_urb调用几乎立即返回,并且该 urb 在后台取消链接。否则,该函数将等到 urb 完全取消链接并完成后再返回。请小心使用该位,因为它会使同步问题很难调试。

If set, the call to usb_unlink_urb for this urb returns almost immediately, and the urb is unlinked in the background. Otherwise, the function waits until the urb is completely unlinked and finished before returning. Use this bit with care, because it can make synchronization issues very difficult to debug.

URB_NO_FSBR
URB_NO_FSBR

仅由 UHCI USB 主机控制器驱动程序使用,并告诉它不要尝试执行前端总线回收逻辑。通常不应设置该位,因为具有 UHCI 主机控制器的机器会产生大量 CPU 开销,并且 PCI 总线在等待设置该位的 urb 时处于饱和状态。

Used by only the UHCI USB Host controller driver and tells it to not try to do Front Side Bus Reclamation logic. This bit should generally not be set, because machines with a UHCI host controller create a lot of CPU overhead, and the PCI bus is saturated waiting on a urb that sets this bit.

URB_ZERO_PACKET
URB_ZERO_PACKET

如果设置,当数据与端点数据包边界对齐时,批量输出 urb 将通过发送不包含数据的短数据包来完成。一些损坏的 USB 设备(例如许多 USB 转 IR 设备)需要这样做才能正常工作。

If set, a bulk out urb finishes by sending a short packet containing no data when the data is aligned to an endpoint packet boundary. This is needed by some broken USB devices (such as a number of USB to IR devices) in order to work properly.

URB_NO_INTERRUPT
URB_NO_INTERRUPT

如果设置,当 urb 完成时,硬件可能不会生成中断。应谨慎使用该位,并且仅在将多个 urb 排队到同一端点时使用。USB 核心功能使用它来进行 DMA 缓冲区传输。

If set, the hardware may not generate an interrupt when the urb is finished. This bit should be used with care and only when queuing multiple urbs to the same endpoint. The USB core functions use this in order to do DMA buffer transfers.

void *transfer_buffer
void *transfer_buffer

指向向设备发送数据(对于 OUT urb)或从设备接收数据(对于 IN urb)时使用的缓冲区的指针。为了使主机控制器正确访问此缓冲区,必须通过调用来创建它 kmalloc,而不是在堆栈上或静态地创建它。对于控制端点,该缓冲区用于传输的数据阶段。

Pointer to the buffer to be used when sending data to the device (for an OUT urb) or when receiving data from the device (for an IN urb). In order for the host controller to properly access this buffer, it must be created with a call to kmalloc, not on the stack or statically. For control endpoints, this buffer is for the data stage of the transfer.

dma_addr_t transfer_dma
dma_addr_t transfer_dma

用于使用 DMA 将数据传输到 USB 设备的缓冲区。

Buffer to be used to transfer data to the USB device using DMA.

int transfer_buffer_length
int transfer_buffer_length

transfer_buffer或变量指向的缓冲区的长度transfer_dma (因为 urb 只能使用一个)。如果是0,则 USB 内核不会使用两个传输缓冲区。

The length of the buffer pointed to by the transfer_buffer or the transfer_dma variable (as only one can be used for a urb). If this is 0, neither transfer buffers are used by the USB core.

对于 OUT 端点,如果端点最大大小小于此变量中指定的值,则到 USB 设备的传输将被分成更小的块,以便正确传输数据。这种大的传输发生在连续的 USB 帧中。在一个 urb 中提交一大块数据,并让 USB 主控制器将其分割成更小的数据块,比按连续顺序发送较小的缓冲区要快得多。

For an OUT endpoint, if the endpoint maximum size is smaller than the value specified in this variable, the transfer to the USB device is broken up into smaller chunks in order to properly transfer the data. This large transfer occurs in consecutive USB frames. It is much faster to submit a large block of data in one urb, and have the USB host controller split it up into smaller pieces, than it is to send smaller buffers in consecutive order.

unsigned char *setup_packet
unsigned char *setup_packet

指向控制 urb 的设置数据包的指针。它在传输缓冲区中的数据之前传输。该变量仅对控制 urbs 有效。

Pointer to the setup packet for a control urb. It is transferred before the data in the transfer buffer. This variable is valid only for control urbs.

dma_addr_t setup_dma
dma_addr_t setup_dma

用于控制 urb 的设置数据包的 DMA 缓冲区。它在正常传输缓冲区中的数据之前传输。该变量仅对控制 urbs 有效。

DMA buffer for the setup packet for a control urb. It is transferred before the data in the normal transfer buffer. This variable is valid only for control urbs.

usb_complete_t complete
usb_complete_t complete

指向完成处理函数的指针,当 urb 完全传输或 urb 发生错误时,由 USB 内核调用该函数。在此函数中,USB 驱动程序可以检查 urb、释放它或重新提交它以进行另一次传输。( 有关完成处理程序的更多详细信息,请参阅第 13.3.4 节。)

Pointer to the completion handler function that is called by the USB core when the urb is completely transferred or when an error occurs to the urb. Within this function, the USB driver may inspect the urb, free it, or resubmit it for another transfer. (See the Section 13.3.4 for more details about the completion handler.)

typedefusb_complete_t定义为:

The usb_complete_t typedef is defined as:

typedef void (*usb_complete_t)(struct urb *, struct pt_regs *);
typedef void (*usb_complete_t)(struct urb *, struct pt_regs *);
void *context
void *context

指向可由 USB 驱动程序设置的数据 blob 的指针。当 urb 返回给驱动程序时,它可以在完成处理程序中使用。有关此变量的更多详细信息,请参阅以下部分。

Pointer to a data blob that can be set by the USB driver. It can be used in the completion handler when the urb is returned to the driver. See the following section for more details about this variable.

int actual_length
int actual_length

当 urb 完成时,该变量被设置为 urb 发送的数据的实际长度(对于 OUT urbs)或 urb 接收的数据(对于 IN urbs)。对于 IN urbs,必须使用它来代替该transfer_buffer_length变量,因为接收到的数据可能小于整个缓冲区大小。

When the urb is finished, this variable is set to the actual length of the data either sent by the urb (for OUT urbs) or received by the urb (for IN urbs.) For IN urbs, this must be used instead of the transfer_buffer_length variable, because the data received could be smaller than the whole buffer size.

int status
int status

当 urb 完成或由 USB 核心处理时,该变量将设置为 urb 的当前状态。USB 驱动程序唯一可以安全访问此变量的时间是在 urb 完成处理函数中(如第 13.3.4 节所述)。此限制是为了防止 USB 核心处理 urb 时发生竞争状况。对于同步 urb,此变量中的成功值 ( 0) 仅指示 urb 是否已取消链接。要获得等时 urb 的详细状态,iso_frame_desc应检查变量。

When the urb is finished, or being processed by the USB core, this variable is set to the current status of the urb. The only time a USB driver can safely access this variable is in the urb completion handler function (described in Section 13.3.4). This restriction is to prevent race conditions that occur while the urb is being processed by the USB core. For isochronous urbs, a successful value (0) in this variable merely indicates whether the urb has been unlinked. To obtain a detailed status on isochronous urbs, the iso_frame_desc variables should be checked.

该变量的有效值包括:

Valid values for this variable include:

0
0

urb转移成功。

The urb transfer was successful.

-ENOENT
-ENOENT

urb 通过调用usb_kill_urb停止。

The urb was stopped by a call to usb_kill_urb.

-ECONNRESET
-ECONNRESET

通过调用usb_unlink_urb取消 urb 的链接,并且 transfer_flagsurb 的变量设置为 URB_ASYNC_UNLINK

The urb was unlinked by a call to usb_unlink_urb, and the transfer_flags variable of the urb was set to URB_ASYNC_UNLINK.

-EINPROGRESS
-EINPROGRESS

urb 仍在由 USB 主控制器处理。如果您的驱动程序看到此值,则表明您的驱动程序中存在错误。

The urb is still being processed by the USB host controllers. If your driver ever sees this value, it is a bug in your driver.

-EPROTO
-EPROTO

此 urb 发生以下错误之一:

  • 传输过程中发生位错误。

  • 硬件没有及时收到响应包。

One of the following errors occurred with this urb:

  • A bitstuff error happened during the transfer.

  • No response packet was received in time by the hardware.

-EILSEQ
-EILSEQ

urb 传输中存在 CRC 不匹配。

There was a CRC mismatch in the urb transfer.

-EPIPE
-EPIPE

端点现在已停止。如果涉及的端点不是控制端点,则可以通过调用函数 usb_clear_halt来清除此错误。

The endpoint is now stalled. If the endpoint involved is not a control endpoint, this error can be cleared through a call to the function usb_clear_halt.

-ECOMM
-ECOMM

传输过程中接收数据的速度比写入系统内存的速度快。此错误值仅发生在 IN urb 上。

Data was received faster during the transfer than it could be written to system memory. This error value happens only for an IN urb.

-ENOSR
-ENOSR

在传输过程中,无法从系统内存中检索数据,速度不够快,无法跟上所请求的 USB 数据速率。此错误值仅发生在 OUT urb 上。

Data could not be retrieved from the system memory during the transfer fast enough to keep up with the requested USB data rate. This error value happens only for an OUT urb.

-EOVERFLOW
-EOVERFLOW

urb 发生了“babble”错误。当端点接收到的数据多于端点指定的最大数据包大小时,就会发生“babble”错误。

A "babble" error happened to the urb. A "babble" error occurs when the endpoint receives more data than the endpoint's specified maximum packet size.

-EREMOTEIO
-EREMOTEIO

仅当URB_SHORT_NOT_OK在 urb 的transfer_flags变量中设置了该标志时才会发生,这意味着未收到 urb 请求的全部数据量。

Occurs only if the URB_SHORT_NOT_OK flag is set in the urb's transfer_flags variable and means that the full amount of data requested by the urb was not received.

-ENODEV
-ENODEV

USB 设备现已从系统中消失。

The USB device is now gone from the system.

-EXDEV
-EXDEV

仅针对同步 urb 发生,意味着传输仅部分完成。为了确定传输的内容,驱动程序必须查看各个帧的状态。

Occurs only for a isochronous urb and means that the transfer was only partially completed. In order to determine what was transferred, the driver must look at the individual frame status.

-EINVAL
-EINVAL

urb 发生了一些非常糟糕的事情。USB 内核文档描述了该值的含义:

ISO 疯狂,如果发生这种情况:注销并回家

如果 urb 结构中的参数设置不正确或者 usb_submit_urb调用中的函数参数不正确将 urb 提交给 USB 内核,也可能会发生这种情况。

Something very bad happened with the urb. The USB kernel documentation describes what this value means:

ISO madness, if this happens: Log off and go home

It also can happen if a parameter is incorrectly set in the urb stucture or if an incorrect function parameter in the usb_submit_urb call submitted the urb to the USB core.

-ESHUTDOWN
-ESHUTDOWN

USB 主控制器驱动程序存在严重错误;现在已被禁用,或者设备已与系统断开连接,并且 urb 是在设备被移除后提交的。如果在将 urb 提交给设备时更改设备的配置,也可能会发生这种情况。

通常,错误值-EPROTO-EILSEQ-EOVERFLOW表示设备、设备固件或将设备连接到计算机的电缆的硬件问题。

There was a severe error with the USB host controller driver; it has now been disabled, or the device was disconnected from the system, and the urb was submitted after the device was removed. It can also occur if the configuration was changed for the device, while the urb was submitted to the device.

Generally, the error values -EPROTO, -EILSEQ, and -EOVERFLOW indicate hardware problems with the device, the device firmware, or the cable connecting the device to the computer.

int start_frame
int start_frame

设置或返回要使用的同步传输的初始帧号。

Sets or returns the initial frame number for isochronous transfers to use.

int interval
int interval

urb 轮询的时间间隔。这仅对中断或同步 urb 有效。该值的单位根据设备的速度而有所不同。对于低速和全速设备,单位是帧,相当于毫秒。对于设备来说,单位是微帧,相当于 1/8 毫秒的单位。在将 urb 发送到 USB 内核之前,必须由 USB 驱动程序为同步或中断 urb 设置该值。

The interval at which the urb is polled. This is valid only for interrupt or isochronous urbs. The value's units differ depending on the speed of the device. For low-speed and full-speed devices, the units are frames, which are equivalent to milliseconds. For devices, the units are in microframes, which is equivalent to units of 1/8 milliseconds. This value must be set by the USB driver for isochronous or interrupt urbs before the urb is sent to the USB core.

int number_of_packets
int number_of_packets

仅对等时 urb 有效,并指定此 urb 要处理的等时传输缓冲区的数量。在将 urb 发送到 USB 核心之前,必须由 USB 驱动程序为同步 urb 设置该值。

Valid only for isochronous urbs and specifies the number of isochronous transfer buffers to be handled by this urb. This value must be set by the USB driver for isochronous urbs before the urb is sent to the USB core.

int error_count
int error_count

仅在同步 urb 完成后由 USB 内核设置。它指定报告任何类型错误的同步传输的数量。

Set by the USB core only for isochronous urbs after their completion. It specifies the number of isochronous transfers that reported any type of error.

struct usb_iso_packet_descriptor iso_frame_desc[0]
struct usb_iso_packet_descriptor iso_frame_desc[0]

仅对同步城市有效。struct usb_iso_packet_descriptor该变量是组成该 urb 的结构数组。此结构允许单个 urb 一次定义多个同步传输。它还用于收集每个单独传输的传输状态。

Valid only for isochronous urbs. This variable is an array of the struct usb_iso_packet_descriptor structures that make up this urb. This structure allows a single urb to define a number of isochronous transfers at once. It is also used to collect the transfer status of each individual transfer.

struct usb_iso_packet_descriptor由以下字段组成:

The struct usb_iso_packet_descriptor is made up of the following fields:

unsigned int offset
unsigned int offset

该数据包数据所在的传输缓冲区的偏移量(从0第一个字节开始)。

The offset into the transfer buffer (starting at 0 for the first byte) where this packet's data is located.

unsigned int length
unsigned int length

该数据包的传输缓冲区的长度。

The length of the transfer buffer for this packet.

unsigned int actual_length
unsigned int actual_length

接收到该同步数据包的传输缓冲区中的数据长度。

The length of the data received into the transfer buffer for this isochronous packet.

unsigned int status
unsigned int status

该数据包的各个同步传输的状态。它可以采用与主struct urb 结构status变量相同的返回值。

The status of the individual isochronous transfer of this packet. It can take the same return values as the main struct urb structure's status variable.

创建和销毁城市

Creating and Destroying Urbs

struct urb结构绝不能 可以在驱动程序或另一个结构中静态创建,因为这会破坏 USB 核心用于 urbs 的引用计数方案。它必须通过调用来创建 usb_alloc_urb函数。该函数的原型为:

The struct urb structure must never be created statically in a driver or within another structure, because that would break the reference counting scheme used by the USB core for urbs. It must be created with a call to the usb_alloc_urb function. This function has the prototype:

struct urb *usb_alloc_urb(int iso_packets, int mem_flags);
struct urb *usb_alloc_urb(int iso_packets, int mem_flags);

第一个参数iso_packets是该 urb 应包含的同步数据包的数量。如果您不想创建同步 urb,则应将此变量设置为0。第二个参数是传递给kmalloc函数调用以从内核分配内存的mem_flags相同类型的标志(有关这些标志的详细信息,请参阅第 8.1.1 节)。如果函数成功为 urb 分配足够的空间,则指向 urb 的指针将返回给调用者。如果返回值为NULL,则 USB 内核中发生了一些错误,驱动程序需要正确清理。

The first parameter, iso_packets, is the number of isochronous packets this urb should contain. If you do not want to create an isochronous urb, this variable should be set to 0. The second parameter, mem_flags, is the same type of flag that is passed to the kmalloc function call to allocate memory from the kernel (see Section 8.1.1 for the details on these flags). If the function is successful in allocating enough space for the urb, a pointer to the urb is returned to the caller. If the return value is NULL, some error occurred within the USB core, and the driver needs to clean up properly.

创建 urb 后,必须对其进行正确初始化,然后才能被 USB 内核使用。有关如何初始化不同类型的 urbs,请参阅下一节。

After a urb has been created, it must be properly initialized before it can be used by the USB core. See the next sections for how to initialize different types of urbs.

为了告诉 USB 核心驱动程序已完成 urb,驱动程序必须调用usb_free_urb函数。该函数只有一个参数:

In order to tell the USB core that the driver is finished with the urb, the driver must call the usb_free_urb function. This function only has one argument:

无效 usb_free_urb(struct urb *urb);
void usb_free_urb(struct urb *urb);

参数是指向struct urb您要释放的指针。调用该函数后,urb结构就消失了,驱动程序无法再访问它。

The argument is a pointer to the struct urb you want to release. After this function is called, the urb structure is gone, and the driver cannot access it any more.

中断城市

Interrupt urbs

功能 usb_fill_int_urb是一个辅助函数 正确初始化 urb 以发送到 USB 设备的中断端点:

The function usb_fill_int_urb is a helper function to properly initialize a urb to be sent to an interrupt endpoint of a USB device:

void usb_fill_int_urb(结构 urb *urb, 结构 usb_device *dev,
                      无符号整型管道,无效*transfer_buffer,
                      int buffer_length, usb_complete_t 完成,
                      void *上下文,int间隔);
void usb_fill_int_urb(struct urb *urb, struct usb_device *dev,
                      unsigned int pipe, void *transfer_buffer,
                      int buffer_length, usb_complete_t complete,
                      void *context, int interval);

这个函数包含很多参数:

This function contains a lot of parameters:

struct urb *urb
struct urb *urb

指向要初始化的 urb 的指针。

A pointer to the urb to be initialized.

struct usb_device *dev
struct usb_device *dev

该 urb 要发送到的 USB 设备。

The USB device to which this urb is to be sent.

unsigned int pipe
unsigned int pipe

该 urb 要发送到的 USB 设备的特定端点。该值是使用前面提到的usb_sndintpipeusb_rcvintpipe函数创建的。

The specific endpoint of the USB device to which this urb is to be sent. This value is created with the previously mentioned usb_sndintpipe or usb_rcvintpipe functions.

void *transfer_buffer
void *transfer_buffer

指向缓冲区的指针,从中取出传出数据或接收传入数据。请注意,这不能是静态缓冲区,必须通过调用kmalloc创建。

A pointer to the buffer from which outgoing data is taken or into which incoming data is received. Note that this can not be a static buffer and must be created with a call to kmalloc.

int buffer_length
int buffer_length

指针指向的缓冲区的长度transfer_buffer

The length of the buffer pointed to by the transfer_buffer pointer.

usb_complete_t complete
usb_complete_t complete

指向此 urb 完成时调用的完成处理程序的指针。

Pointer to the completion handler that is called when this urb is completed.

void *context
void *context

指向添加到 urb 结构中的 blob 的指针,以便稍后由完成处理程序函数检索。

Pointer to the blob that is added to the urb structure for later retrieval by the completion handler function.

int interval
int interval

该 urb 的调度时间间隔。请参阅前面的结构描述struct urb以找到该值的正确单位。

The interval at which that this urb should be scheduled. See the previous description of the struct urb structure to find the proper units for this value.

散装城市

Bulk urbs

批量 urb 已初始化很像中断 urbs。执行此操作的函数是 usb_fill_bulk_urb,它看起来像:

Bulk urbs are initialized much like interrupt urbs. The function that does this is usb_fill_bulk_urb, and it looks like:

void usb_fill_bulk_urb(结构 urb *urb, 结构 usb_device *dev,
                       无符号整型管道,无效*transfer_buffer,
                       int buffer_length, usb_complete_t 完成,
                       无效*上下文);
void usb_fill_bulk_urb(struct urb *urb, struct usb_device *dev,
                       unsigned int pipe, void *transfer_buffer,
                       int buffer_length, usb_complete_t complete,
                       void *context);

函数参数与usb_fill_int_urb函数中的参数相同 。但是,没有interval参数,因为批量 urb 没有间隔值。请注意,unsigned int pipe必须通过调用 usb_sndbulkpipeusb_rcvbulkpipe 函数来初始化该变量。

The function parameters are all the same as in the usb_fill_int_urb function. However, there is no interval parameter because bulk urbs have no interval value. Please note that the unsigned int pipe variable must be initialized with a call to the usb_sndbulkpipe or usb_rcvbulkpipe function.

usb_fill_int_urb函数不会在 urb 中设置变量,因此对此字段的任何修改都必须由驱动程序本身完成transfer_flags

The usb_fill_int_urb function does not set the transfer_flags variable in the urb, so any modification to this field has to be done by the driver itself.

控制城市

Control urbs

控制城市初始化方式与批量 urb 几乎相同,通过调用函数 usb_fill_control_urb

Control urbs are initialized almost the same way as bulk urbs, with a call to the function usb_fill_control_urb:

void usb_fill_control_urb(结构 urb *urb, 结构 usb_device *dev,
                          无符号整型管道,无符号字符*setup_packet,
                          无效*transfer_buffer,int buffer_length,
                          usb_complete_t 完成,void *context);
void usb_fill_control_urb(struct urb *urb, struct usb_device *dev,
                          unsigned int pipe, unsigned char *setup_packet,
                          void *transfer_buffer, int buffer_length,
                          usb_complete_t complete, void *context);

函数参数与usb_fill_bulk_urb函数中的参数完全相同 ,只是增加了一个新参数 , unsigned char *setup_packet该参数必须指向要发送到端点的设置数据包数据。此外,unsigned int pipe必须通过调用 usb_sndctrlpipeusb_rcvictrlpipe 函数来初始化该变量。

The function parameters are all the same as in the usb_fill_bulk_urb function, except that there is a new parameter, unsigned char *setup_packet, which must point to the setup packet data that is to be sent to the endpoint. Also, the unsigned int pipe variable must be initialized with a call to the usb_sndctrlpipe or usb_rcvictrlpipe function.

usb_fill_control_urb函数不会设置urb 中的变量,因此对此字段的任何修改都必须由驱动程序本身完成transfer_flags大多数驱动程序不使用此函数,因为使用第 13.5 节中描述的同步 API 调用要简单得多。

The usb_fill_control_urb function does not set the transfer_flags variable in the urb, so any modification to this field has to be done by the driver itself. Most drivers do not use this function, as it is much simpler to use the synchronous API calls as described in Section 13.5.

等时城市

Isochronous urbs

遗憾的是,等时 urb 没有像中断、控制和批量 urb 那样的初始化函数。因此,它们必须在驱动程序中“手动”初始化,然后才能提交给 USB 核心。以下是如何正确初始化此类 urb 的示例。它取自位于 主内核源代码树的drivers/usb/media目录中的konicawc.c内核驱动程序。

Isochronous urbs unfortunately do not have an initializer function like the interrupt, control, and bulk urbs do. So they must be initialized "by hand" in the driver before they can be submitted to the USB core. The following is an example of how to properly initialize this type of urb. It was taken from the konicawc.c kernel driver located in the drivers/usb/media directory in the main kernel source tree.

urb->dev = dev;
urb->上下文= uvd;
urb->pipe = usb_rcvisocpipe(dev, uvd->video_endp-1);
urb->间隔 = 1;
urb->transfer_flags = URB_ISO_ASAP;
urb->transfer_buffer = cam->sts_buf[i];
urb->complete = konicawc_isoc_irq;
urb->number_of_packets = FRAMES_PER_DESC;
urb->transfer_buffer_length = FRAMES_PER_DESC;
for (j=0; j < FRAMES_PER_DESC; j++) {
        urb->iso_frame_desc[j].offset = j;
        urb->iso_frame_desc[j].length = 1;
}
urb->dev = dev;
urb->context = uvd;
urb->pipe = usb_rcvisocpipe(dev, uvd->video_endp-1);
urb->interval = 1;
urb->transfer_flags = URB_ISO_ASAP;
urb->transfer_buffer = cam->sts_buf[i];
urb->complete = konicawc_isoc_irq;
urb->number_of_packets = FRAMES_PER_DESC;
urb->transfer_buffer_length = FRAMES_PER_DESC;
for (j=0; j < FRAMES_PER_DESC; j++) {
        urb->iso_frame_desc[j].offset = j;
        urb->iso_frame_desc[j].length = 1;
}

提交城市

Submitting Urbs

一旦 urb 被 由 USB 驱动程序正确创建和初始化后,它就可以提交到 USB 核心以发送到 USB 设备。这是通过调用函数来完成的 usb_submit_urb

Once the urb has been properly created and initialized by the USB driver, it is ready to be submitted to the USB core to be sent out to the USB device. This is done with a call to the function usb_submit_urb:

int usb_submit_urb(struct urb *urb, int mem_flags);
int usb_submit_urb(struct urb *urb, int mem_flags);

urb参数是指向要发送到设备的 urb 的指针。该mem_flags参数相当于传递给kmalloc调用的相同参数,用于告诉 USB 内核此时如何分配任何内存缓冲区。

The urb parameter is a pointer to the urb that is to be sent to the device. The mem_flags parameter is equivalent to the same parameter that is passed to the kmalloc call and is used to tell the USB core how to allocate any memory buffers at this moment in time.

urb 成功提交给 USB 内核后,在调用完整函数之前,它不应该尝试访问 urb 结构的任何字段。

After a urb has been submitted to the USB core successfully, it should never try to access any fields of the urb structure until the complete function is called.

由于函数usb_submit_urb可以随时调用(包括在中断上下文中),因此变量的指定mem_flags必须正确。实际上只应使用三个有效值,具体取决于 调用usb_submit_urb的时间:

Because the function usb_submit_urb can be called at any time (including from within an interrupt context), the specification of the mem_flags variable must be correct. There are really only three valid values that should be used, depending on when usb_submit_urb is being called:

GFP_ATOMIC
GFP_ATOMIC

只要满足以下条件,就应使用该值:

  • 调用者位于 urb 完成处理程序、中断、下半部、微线程或计时器回调中。

  • 调用者持有自旋锁或 rwlock。请注意,如果正在持有信号量,则不需要该值。

  • 事实current->state并非如此TASK_RUNNINGTASK_RUNNING除非驱动程序本身更改了当前状态,否则该状态始终不变。

This value should be used whenever the following are true:

  • The caller is within a urb completion handler, an interrupt, a bottom half, a tasklet, or a timer callback.

  • The caller is holding a spinlock or rwlock. Note that if a semaphore is being held, this value is not necessary.

  • The current->state is not TASK_RUNNING. The state is always TASK_RUNNING unless the driver has changed the current state itself.

GFP_NOIO
GFP_NOIO

如果驱动程序位于块 I/O 补丁中,则应使用此值。它也应该用在所有存储类型设备的错误处理路径中。

This value should be used if the driver is in the block I/O patch. It should also be used in the error handling path of all storage-type devices.

GFP_KERNEL
GFP_KERNEL

这应该用于不属于前面提到的类别之一的所有其他情况。

This should be used for all other situations that do not fall into one of the previously mentioned categories.

完成 Urbs:完成回调处理程序

Completing Urbs: The Completion Callback Handler

如果对usb_submit_urb 的调用是 成功,将urb的控制权转移给USB核心,函数返回0;否则,返回负错误号。如果函数成功,则 urb 的完成处理程序(由 complete)将在 urb 完成时被调用一次。当调用此函数时,USB 内核完成 URB 的处理,并且对其的控制现在返回到设备驱动程序。

If the call to usb_submit_urb was successful, transferring control of the urb to the USB core, the function returns 0; otherwise, a negative error number is returned. If the function succeeds, the completion handler of the urb (as specified by the complete function pointer) is called exactly once when the urb is completed. When this function is called, the USB core is finished with the URB, and control of it is now returned to the device driver.

一个城市只有三种方式才能完成并拥有 完整的 函数调用:

There are only three ways a urb can be finished and have the complete function called:

  • urb 已成功发送到设备,并且设备返回正确的确认。对于OUT urb,数据已成功发送,对于IN urb,请求的数据已成功接收。如果发生这种情况,statusurb 中的变量将设置为0

  • The urb is successfully sent to the device, and the device returns the proper acknowledgment. For an OUT urb, the data was successfully sent, and for an IN urb, the requested data was successfully received. If this has happened, the status variable in the urb is set to 0.

  • 从设备发送或接收数据时发生某种错误。status这由urb 结构中的变量中的错误值指出。

  • Some kind of error happened when sending or receiving data from the device. This is noted by the error value in the status variable in the urb structure.

  • urb 与 USB 核心“取消链接”。当驱动程序通过调用 usb_unlink_urbusb_kill_urb告诉 USB 核心取消提交的 urb时,或者当设备从系统中删除并且已向其提交 urb 时,会发生这种情况。

  • The urb was "unlinked" from the USB core. This happens either when the driver tells the USB core to cancel a submitted urb with a call to usb_unlink_urb or usb_kill_urb, or when a device is removed from the system and a urb had been submitted to it.

本章后面将展示如何测试 urb 完成调用中不同返回值的示例。

An example of how to test for the different return values within a urb completion call is shown later in this chapter.

取消城市

Canceling Urbs

停止已经存在的城市 提交给USB核心,函数 usb_kill_urb usb_unlink_urb应该被调用:

To stop a urb that has been submitted to the USB core, the functions usb_kill_urb or usb_unlink_urb should be called:

int usb_kill_urb(struct urb *urb);
int usb_unlink_urb(struct urb *urb);
int usb_kill_urb(struct urb *urb);
int usb_unlink_urb(struct urb *urb);

这两个函数的参数urb都是指向要取消的 urb 的指针。

The urb parameter for both of these functions is a pointer to the urb that is to be canceled.

当函数为usb_kill_urb时,urb 生命周期停止。该函数通常在设备与系统断开连接时在断开连接回调中使用。

When the function is usb_kill_urb, the urb lifecycle is stopped. This function is usually used when the device is disconnected from the system, in the disconnect callback.

对于某些驱动程序,应使用usb_unlink_urb函数告诉 USB 核心停止 urb。该函数不会等待 urb 完全停止才返回给调用者。这对于在中断处理程序中或持有自旋锁时停止 urb 非常有用,因为等待 urb 完全停止需要 USB 内核能够将调用进程置于睡眠状态。该函数需要URB_ASYNC_UNLINK标志值 设置在被要求停止以便正常工作的 urb 中。

For some drivers, the usb_unlink_urb function should be used to tell the USB core to stop an urb. This function does not wait for the urb to be fully stopped before returning to the caller. This is useful for stopping the urb while in an interrupt handler or when a spinlock is held, as waiting for a urb to fully stop requires the ability for the USB core to put the calling process to sleep. This function requires that the URB_ASYNC_UNLINK flag value be set in the urb that is being asked to be stopped in order to work properly.

编写 USB 驱动程序

Writing a USB Driver

USB烧写方法 设备驱动程序类似于pci_driver:驱动程序向 USB 子系统注册其驱动程序对象,然后使用供应商和设备标识符来判断其硬件是否已安装。

The approach to writing a USB device driver is similar to a pci_driver: the driver registers its driver object with the USB subsystem and later uses vendor and device identifiers to tell if its hardware has been installed.

驱动程序支持哪些设备?

What Devices Does the Driver Support?

struct usb_device_id结构提供了该驱动程序支持的不同类型 USB 设备的列表。USB 核心使用此列表来决定将设备提供给哪个驱动程序,并由热插拔脚本来决定当特定设备插入系统时自动加载哪个驱动程序。

The struct usb_device_id structure provides a list of different types of USB devices that this driver supports. This list is used by the USB core to decide which driver to give a device to, and by the hotplug scripts to decide which driver to automatically load when a specific device is plugged into the system.

struct usb_device_id 结构体由以下字段定义:

The struct usb_device_id structure is defined with the following fields:

_ _u16 match_flags
_ _u16 match_flags

确定设备应与结构中的以下哪些字段进行匹配。这是由 include/linux/mod_devicetable.hUSB_DEVICE_ID_MATCH_*文件中指定的不同值定义的位字段。该字段通常不会直接设置,而是由稍后描述的类型宏初始化。USB_DEVICE

Determines which of the following fields in the structure the device should be matched against. This is a bit field defined by the different USB_DEVICE_ID_MATCH_* values specified in the include/linux/mod_devicetable.h file. This field is usually never set directly but is initialized by the USB_DEVICE type macros described later.

_ _u16 idVendor
_ _u16 idVendor

设备的 USB 供应商 ID。该号码由USB论坛分配给其成员,不能由其他任何人弥补。

The USB vendor ID for the device. This number is assigned by the USB forum to its members and cannot be made up by anyone else.

_ _u16 idProduct
_ _u16 idProduct

设备的 USB 产品 ID。所有分配了供应商 ID 的供应商都可以按照自己的选择管理其产品 ID。

The USB product ID for the device. All vendors that have a vendor ID assigned to them can manage their product IDs however they choose to.

_ _u16 bcdDevice_lo

_ _u16 bcdDevice_hi
_ _u16 bcdDevice_lo

_ _u16 bcdDevice_hi

定义供应商分配的产品版本号范围的低端和高端。该bcdDevice_hi值包含在内;它的值是编号最大的设备的编号。这两个值均表示为 二进制编码的十进制 (BCD) 形式。这些变量与idVendor和结合idProduct使用,用于定义设备的特定版本。

Define the low and high ends of the range of the vendor-assigned product version number. The bcdDevice_hi value is inclusive; its value is the number of the highest-numbered device. Both of these values are expressed in binary-coded decimal (BCD) form. These variables, combined with the idVendor and idProduct, are used to define a specific version of a device.

_ _u8 bDeviceClass

_ _u8 bDeviceSubClass

_ _u8 bDeviceProtocol
_ _u8 bDeviceClass

_ _u8 bDeviceSubClass

_ _u8 bDeviceProtocol

分别定义设备的类、子类和协议。这些编号由 USB 论坛分配,并在 USB 规范中定义。这些值指定整个设备的行为,包括该设备上的所有接口。

Define the class, subclass, and protocol of the device, respectively. These numbers are assigned by the USB forum and are defined in the USB specification. These values specify the behavior for the whole device, including all interfaces on this device.

_ _u8 bInterfaceClass

_ _u8 bInterfaceSubClass

_ _u8 bInterfaceProtocol
_ _u8 bInterfaceClass

_ _u8 bInterfaceSubClass

_ _u8 bInterfaceProtocol

与上面的设备特定值非常相似,它们分别定义各个接口的类、子类和协议。这些编号由 USB 论坛分配,并在 USB 规范中定义。

Much like the device-specific values above, these define the class, subclass, and protocol of the individual interface, respectively. These numbers are assigned by the USB forum and are defined in the USB specification.

kernel_ulong_t driver_info
kernel_ulong_t driver_info

该值不用于匹配,但它包含驱动程序可用于在 probeUSB 驱动程序的回调函数中区分不同设备的信息。

This value is not used to match against, but it holds information that the driver can use to differentiate the different devices from each other in the probe callback function to the USB driver.

与 PCI 设备一样,有许多宏用于初始化该结构:

As with PCI devices, there are a number of macros that are used to initialize this structure:

USB_DEVICE(vendor, product)
USB_DEVICE(vendor, product)

创建struct usb_device_id可用于仅匹配指定供应商和产品 ID 值的 。这对于需要特定驱动程序的 USB 设备非常常用。

Creates a struct usb_device_id that can be used to match only the specified vendor and product ID values. This is very commonly used for USB devices that need a specific driver.

USB_DEVICE_VER(vendor, product, lo, hi)
USB_DEVICE_VER(vendor, product, lo, hi)

创建struct usb_device_id可用于仅匹配版本范围内指定的供应商和产品 ID 值的 。

Creates a struct usb_device_id that can be used to match only the specified vendor and product ID values within a version range.

USB_DEVICE_INFO(class, subclass, protocol)
USB_DEVICE_INFO(class, subclass, protocol)

创建struct usb_device_id可用于匹配特定类别 USB 设备的 。

Creates a struct usb_device_id that can be used to match a specific class of USB devices.

USB_INTERFACE_INFO(class, subclass, protocol)
USB_INTERFACE_INFO(class, subclass, protocol)

创建一个struct usb_device_id可用于匹配特定类别的 USB 接口。

Creates a struct usb_device_id that can be used to match a specific class of USB interfaces.

因此,对于仅控制来自单个供应商的单个 USB 设备的简单 USB 设备驱动程序,该struct usb_device_id表将定义为:

So, for a simple USB device driver that controls only a single USB device from a single vendor, the struct usb_device_id table would be defined as:

/* 与该驱动程序配合使用的设备表 */
静态结构 usb_device_id skel_table [ ] = {
    { USB_DEVICE(USB_SKEL_VENDOR_ID, USB_SKEL_PRODUCT_ID) },
    { } /* 终止条目 */
};
MODULE_DEVICE_TABLE(USB,skel_table);
/* table of devices that work with this driver */
static struct usb_device_id skel_table [  ] = {
    { USB_DEVICE(USB_SKEL_VENDOR_ID, USB_SKEL_PRODUCT_ID) },
    { }                 /* Terminating entry */
};
MODULE_DEVICE_TABLE (usb, skel_table);

与 PCI 驱动程序一样,MODULE_DEVICE_TABLE宏对于允许用户空间工具确定该驱动程序可以控制哪些设备是必要的。但对于 USB 驱动程序,该字符串usb必须是宏中的第一个值。

As with a PCI driver, the MODULE_DEVICE_TABLE macro is necessary to allow user-space tools to figure out what devices this driver can control. But for USB drivers, the string usb must be the first value in the macro.

注册 USB 驱动程序

Registering a USB Driver

所有USB驱动的主要结构必须创建的是一个struct usb_driver. 该结构必须由 USB 驱动程序填写,并包含许多向 USB 核心代码描述 USB 驱动程序的函数回调和变量:

The main structure that all USB drivers must create is a struct usb_driver. This structure must be filled out by the USB driver and consists of a number of function callbacks and variables that describe the USB driver to the USB core code:

struct module *owner
struct module *owner

指向该驱动程序的模块所有者的指针。USB 内核使用它来正确引用该 USB 驱动程序,以便它不会在不合时宜的时刻被卸载。该变量应设置为THIS_MODULE 宏。

Pointer to the module owner of this driver. The USB core uses it to properly reference count this USB driver so that it is not unloaded at inopportune moments. The variable should be set to the THIS_MODULE macro.

const char *name
const char *name

指向驱动程序名称的指针。它在内核中的所有 USB 驱动程序中必须是唯一的,并且通常设置为与驱动程序的模块名称相同的名称。当驱动程序位于内核中时,它会显示在/sys/bus/usb/drivers/下的 sysfs 中。

Pointer to the name of the driver. It must be unique among all USB drivers in the kernel and is normally set to the same name as the module name of the driver. It shows up in sysfs under /sys/bus/usb/drivers/ when the driver is in the kernel.

const struct usb_device_id *id_table
const struct usb_device_id *id_table

指向struct usb_device_id包含此驱动程序可接受的所有不同类型 USB 设备列表的表的指针。如果未设置此变量,则probe 永远不会调用 USB 驱动程序中的回调函数。如果您希望系统中的每个 USB 设备始终调用您的驱动程序,请创建一个仅设置该字段的条目 driver_info

Pointer to the struct usb_device_id table that contains a list of all of the different kinds of USB devices this driver can accept. If this variable is not set, the probe function callback in the USB driver is never called. If you want your driver always to be called for every USB device in the system, create a entry that sets only the driver_info field:

静态结构 usb_device_id usb_ids[ ] = {
    {.driver_info = 42},
    { }
};
static struct usb_device_id usb_ids[  ] = {
    {.driver_info = 42},
    {  }
};
int (*probe) (struct usb_interface *intf, const struct usb_device_id *id)
int (*probe) (struct usb_interface *intf, const struct usb_device_id *id)

指向 USB 驱动程序中探针函数的指针。当 USB 核心认为它有一个该驱动程序可以处理的函数时,它会调用此函数( 第 13.4.3 节中描述)。struct usb_interface指向struct usb_device_idUSB 内核用于做出此决定的指针也传递给此函数。如果 USB 驱动程序声明struct usb_interface传递给它的 ,它应该正确初始化设备并返回0。如果驱动程序不想声明该设备,或者发生错误,则应返回负错误值。

Pointer to the probe function in the USB driver. This function (described in Section 13.4.3) is called by the USB core when it thinks it has a struct usb_interface that this driver can handle. A pointer to the struct usb_device_id that the USB core used to make this decision is also passed to this function. If the USB driver claims the struct usb_interface that is passed to it, it should initialize the device properly and return 0. If the driver does not want to claim the device, or an error occurs, it should return a negative error value.

void (*disconnect) (struct usb_interface *intf)
void (*disconnect) (struct usb_interface *intf)

指向 USB 驱动程序中断开连接函数的指针。当 USB 内核从系统中删除或从 USB 内核卸载驱动程序时,该函数(在第 13.4.3 节中描述)由 USB 内核调用。struct usb_interface

Pointer to the disconnect function in the USB driver. This function (described in Section 13.4.3) is called by the USB core when the struct usb_interface has been removed from the system or when the driver is being unloaded from the USB core.

所以,创造价值struct usb_driver 结构体中,只需要初始化五个字段:

So, to create a value struct usb_driver structure, only five fields need to be initialized:

静态结构usb_driver skel_driver = {
    .owner = THIS_MODULE,
    .name = "骨架",
    .id_table = skel_table,
    .probe = skel_probe,
    .disconnect = skel_disconnect,
};
static struct usb_driver skel_driver = {
    .owner = THIS_MODULE,
    .name = "skeleton",
    .id_table = skel_table,
    .probe = skel_probe,
    .disconnect = skel_disconnect,
};

它们struct usb_driver确实包含更多回调,这些回调通常不经常使用,并且 USB 驱动程序正常工作也不需要这些回调:

The struct usb_driver does contain a few more callbacks, which are generally not used very often, and are not required in order for a USB driver to work properly:

int (*ioctl) (struct usb_interface *intf, unsigned int code, void *buf)
int (*ioctl) (struct usb_interface *intf, unsigned int code, void *buf)

指向USB 驱动程序中ioctl函数的指针。如果存在,则当用户空间程序对与连接到此 USB 驱动程序的 USB 设备关联的usbfs文件系统设备条目进行ioctl 调用时,将调用它。实际上,只有 USB 集线器驱动程序使用此 ioctl,因为任何其他 USB 驱动程序都不需要使用它。

Pointer to an ioctl function in the USB driver. If it is present, it is called when a user-space program makes a ioctl call on the usbfs filesystem device entry associated with a USB device attached to this USB driver. In pratice, only the USB hub driver uses this ioctl, as there is no other real need for any other USB driver to use it.

int (*suspend) (struct usb_interface *intf, u32 state)
int (*suspend) (struct usb_interface *intf, u32 state)

指向 USB 驱动程序中挂起函数的指针。当 USB 核心要挂起设备时调用它。

Pointer to a suspend function in the USB driver. It is called when the device is to be suspended by the USB core.

int (*resume) (struct usb_interface *intf)
int (*resume) (struct usb_interface *intf)

指向 USB 驱动程序中恢复函数的指针。当 USB 核心恢复设备时调用它。

Pointer to a resume function in the USB driver. It is called when the device is being resumed by the USB core.

要注册struct usb_driver对于 USB 核心,通过指向 .usb_register_driver 的指针调用usb_register_driver struct usb_driver。传统上,这是在 USB 驱动程序的模块初始化代码中完成的:

To register the struct usb_driver with the USB core, a call to usb_register_driver is made with a pointer to the struct usb_driver. This is traditionally done in the module initialization code for the USB driver:

静态 int _ _init usb_skel_init(void)
{
    整数结果;

    /* 向 USB 子系统注册该驱动程序 */
    结果 = usb_register(&skel_driver);
    如果(结果)
        err("usb_register 失败。错误号 %d", result);

    返回结果;
}
static int _ _init usb_skel_init(void)
{
    int result;

    /* register this driver with the USB subsystem */
    result = usb_register(&skel_driver);
    if (result)
        err("usb_register failed. Error number %d", result);

    return result;
}

当USB驱动程序要被卸载后,struct usb_driver需要从内核中取消注册。这是通过调用usb_deregister_driver来完成的。发生此调用时,当前绑定到此驱动程序的任何 USB 接口都会断开连接,并且 断开连接 为它们调用函数。

When the USB driver is to be unloaded, the struct usb_driver needs to be unregistered from the kernel. This is done with a call to usb_deregister_driver. When this call happens, any USB interfaces that were currently bound to this driver are disconnected, and the disconnect function is called for them.

静态无效_ _exit usb_skel_exit(无效)
{
    /* 取消注册该驱动程序到 USB 子系统 */
    usb_deregister(&skel_driver);
}
static void _ _exit usb_skel_exit(void)
{
    /* deregister this driver with the USB subsystem */
    usb_deregister(&skel_driver);
}

详细探测和断开连接

probe and disconnect in Detail

struct usb_driver上一节描述的结构中,驱动程序指定了 USB 内核在适当时间调用的两个函数。当 USB 核心认为该驱动程序应该处理的设备已安装时,将调用探测函数;探测 函数应该对传递给它的有关设备信息进行检查,并确定驱动程序是否真正适合该设备。当驱动程序由于某种原因不再控制设备并且可以进行清理时,将调用断开连接函数。

In the struct usb_driver structure described in the previous section, the driver specified two functions that the USB core calls at appropriate times. The probe function is called when a device is installed that the USB core thinks this driver should handle; the probe function should perform checks on the information passed to it about the device and decide whether the driver is really appropriate for that device. The disconnect function is called when the driver should no longer control the device for some reason and can do clean-up.

探头和断开连接 函数回调是在 USB 集线器内核线程的上下文中调用的,因此在其中休眠是合法的。但是,建议尽可能在用户打开设备时完成大部分工作,以将 USB 探测时间保持在最短水平。这是因为 USB 核心在单个线程中处理 USB 设备的添加和删除,因此任何缓慢的设备驱动程序都可能导致 USB 设备检测时间变慢并被用户注意到。

Both the probe and disconnect function callbacks are called in the context of the USB hub kernel thread, so it is legal to sleep within them. However, it is recommended that the majority of work be done when the device is opened by a user if possible, in order to keep the USB probing time to a minimum. This is because the USB core handles the addition and removal of USB devices within a single thread, so any slow device driver can cause the USB device detection time to slow down and become noticeable by the user.

探测函数回调中,USB 驱动程序应初始化可能用于管理 USB 设备的任何本地结构。它还应该将所需的有关设备的任何信息保存到本地结构中,因为此时通常更容易这样做。例如,USB 驱动程序通常想要检测设备的端点地址和缓冲区大小,因为它们是与设备通信所必需的。下面是一些示例代码,用于检测 BULK 类型的 IN 和 OUT 端点并将有关它们的一些信息保存在本地设备结构中:

In the probe function callback, the USB driver should initialize any local structures that it might use to manage the USB device. It should also save any information that it needs about the device to the local structure, as it is usually easier to do so at this time. As an example, USB drivers usually want to detect what the endpoint address and buffer sizes are for the device, as they are needed in order to communicate with the device. Here is some example code that detects both IN and OUT endpoints of BULK type and saves some information about them in a local device structure:

/* 设置端点信息 */
/* 仅使用第一个批量输入和批量输出端点 */
iface_desc = 接口->cur_altsetting;
for (i = 0; i < iface_desc->desc.bNumEndpoints; ++i) {
    端点 = &iface_desc->端点[i].desc;

    if (!dev->bulk_in_endpointAddr &&
        (端点->bEndpointAddress & USB_DIR_IN) &&
        ((端点->bmAttributes & USB_ENDPOINT_XFERTYPE_MASK)
                = = USB_ENDPOINT_XFER_BULK)) {
        /* 我们发现大量端点 */
        buffer_size = 端点->wMaxPacketSize;
        dev->bulk_in_size = buffer_size;
        dev->bulk_in_endpointAddr = 端点->bEndpointAddress;
        dev->bulk_in_buffer = kmalloc(buffer_size, GFP_KERNEL);
        如果(!dev->bulk_in_buffer){
            err("无法分配bulk_in_buffer");
            转到错误;
        }
    }

    if (!dev->bulk_out_endpointAddr &&
        !(端点->bEndpointAddress & USB_DIR_IN) &&
        ((端点->bmAttributes & USB_ENDPOINT_XFERTYPE_MASK)
                = = USB_ENDPOINT_XFER_BULK)) {
        /* 我们找到了一个批量输出端点 */
        dev->bulk_out_endpointAddr = 端点->bEndpointAddress;
    }
}
if (!(dev->bulk_in_endpointAddr && dev->bulk_out_endpointAddr)) {
    err("无法同时找到批量输入和批量输出端点");
    转到错误;
}
/* set up the endpoint information */
/* use only the first bulk-in and bulk-out endpoints */
iface_desc = interface->cur_altsetting;
for (i = 0; i < iface_desc->desc.bNumEndpoints; ++i) {
    endpoint = &iface_desc->endpoint[i].desc;

    if (!dev->bulk_in_endpointAddr &&
        (endpoint->bEndpointAddress & USB_DIR_IN) &&
        ((endpoint->bmAttributes & USB_ENDPOINT_XFERTYPE_MASK)
                =  = USB_ENDPOINT_XFER_BULK)) {
        /* we found a bulk in endpoint */
        buffer_size = endpoint->wMaxPacketSize;
        dev->bulk_in_size = buffer_size;
        dev->bulk_in_endpointAddr = endpoint->bEndpointAddress;
        dev->bulk_in_buffer = kmalloc(buffer_size, GFP_KERNEL);
        if (!dev->bulk_in_buffer) {
            err("Could not allocate bulk_in_buffer");
            goto error;
        }
    }

    if (!dev->bulk_out_endpointAddr &&
        !(endpoint->bEndpointAddress & USB_DIR_IN) &&
        ((endpoint->bmAttributes & USB_ENDPOINT_XFERTYPE_MASK)
                =  = USB_ENDPOINT_XFER_BULK)) {
        /* we found a bulk out endpoint */
        dev->bulk_out_endpointAddr = endpoint->bEndpointAddress;
    }
}
if (!(dev->bulk_in_endpointAddr && dev->bulk_out_endpointAddr)) {
    err("Could not find both bulk-in and bulk-out endpoints");
    goto error;
}

该代码块首先循环该接口中存在的每个端点,并为端点结构分配一个本地指针,以便以后更容易访问:

This block of code first loops over every endpoint that is present in this interface and assigns a local pointer to the endpoint structure to make it easier to access later:

for (i = 0; i < iface_desc->desc.bNumEndpoints; ++i) {
    端点 = &iface_desc->端点[i].desc;
for (i = 0; i < iface_desc->desc.bNumEndpoints; ++i) {
    endpoint = &iface_desc->endpoint[i].desc;

然后,在我们有了一个端点之后,并且还没有找到批量 IN 类型的端点,我们查看该端点的方向是否为 IN。USB_DIR_IN可以通过查看端点变量中是否包含 位掩码来测试这一点bEndpointAddress。如果这是真的,我们通过首先使用位掩码屏蔽变量,然后检查它是否 bmAttributes与值匹配来确定端点类型是否为批量:USB_ENDPOINT_XFERTYPE_MASKUSB_ENDPOINT_XFER_BULK

Then, after we have an endpoint, and we have not found a bulk IN type endpoint already, we look to see if this endpoint's direction is IN. That can be tested by seeing whether the bitmask USB_DIR_IN is contained in the bEndpointAddress endpoint variable. If this is true, we determine whether the endpoint type is bulk or not, by first masking off the bmAttributes variable with the USB_ENDPOINT_XFERTYPE_MASK bitmask, and then checking if it matches the value USB_ENDPOINT_XFER_BULK:

if (!dev->bulk_in_endpointAddr &&
    (端点->bEndpointAddress & USB_DIR_IN) &&
    ((端点->bmAttributes & USB_ENDPOINT_XFERTYPE_MASK)
            = = USB_ENDPOINT_XFER_BULK)) {
if (!dev->bulk_in_endpointAddr &&
    (endpoint->bEndpointAddress & USB_DIR_IN) &&
    ((endpoint->bmAttributes & USB_ENDPOINT_XFERTYPE_MASK)
            =  = USB_ENDPOINT_XFER_BULK)) {

如果所有这些测试都为真,则驱动程序知道它找到了正确类型的端点,并且可以保存有关端点的信息,稍后需要在本地结构中通过该端点进行通信:

If all of these tests are true, the driver knows it found the proper type of endpoint and can save the information about the endpoint that it will later need to communicate over it in a local structure:

/* 我们发现大量端点 */
buffer_size = 端点->wMaxPacketSize;
dev->bulk_in_size = buffer_size;
dev->bulk_in_endpointAddr = 端点->bEndpointAddress;
dev->bulk_in_buffer = kmalloc(buffer_size, GFP_KERNEL);
如果(!dev->bulk_in_buffer){
    err("无法分配bulk_in_buffer");
    转到错误;
}
/* we found a bulk in endpoint */
buffer_size = endpoint->wMaxPacketSize;
dev->bulk_in_size = buffer_size;
dev->bulk_in_endpointAddr = endpoint->bEndpointAddress;
dev->bulk_in_buffer = kmalloc(buffer_size, GFP_KERNEL);
if (!dev->bulk_in_buffer) {
    err("Could not allocate bulk_in_buffer");
    goto error;
}

struct usb_interface由于 USB 驱动程序需要在设备生命周期的后期检索与之关联的本地数据结构,因此该函数 usb_set_intfdata可以调用:

Because the USB driver needs to retrieve the local data structure that is associated with this struct usb_interface later in the lifecycle of the device, the function usb_set_intfdata can be called:

/* 将我们的数据指针保存在该接口设备中 */
usb_set_intfdata(接口,开发);
/* save our data pointer in this interface device */
usb_set_intfdata(interface, dev);

该函数接受指向任何数据类型的指针并将其保存在struct usb_interface供以后访问的结构。要检索数据,应调用函数usb_get_intfdata :

This function accepts a pointer to any data type and saves it in the struct usb_interface structure for later access. To retrieve the data, the function usb_get_intfdata should be called:

结构usb_skel *dev;
结构usb_interface *接口;
int 次要;
int retval = 0;

次要=次要(索引节点);

接口 = usb_find_interface(&skel_driver, subminor);
如果(!接口){
    err ("%s - 错误,找不到次要 %d 的设备",
         __FUNCTION__,次要);
    retval = -ENODEV;
    转到退出;
}

dev = usb_get_intfdata(接口);
如果(!dev){
    retval = -ENODEV;
    转到退出;
}
struct usb_skel *dev;
struct usb_interface *interface;
int subminor;
int retval = 0;

subminor = iminor(inode);

interface = usb_find_interface(&skel_driver, subminor);
if (!interface) {
    err ("%s - error, can't find device for minor %d",
         _ _FUNCTION_ _, subminor);
    retval = -ENODEV;
    goto exit;
}

dev = usb_get_intfdata(interface);
if (!dev) {
    retval = -ENODEV;
    goto exit;
}

usb_get_intfdata通常在 USB 驱动程序的 打开函数中调用,并在断开函数中再次调用。由于这两个函数,USB 驱动程序不需要保留一个静态指针数组来存储系统中所有当前设备的单独设备结构。对设备信息的间接引用允许任何 USB 驱动程序支持无限数量的设备。

usb_get_intfdata is usually called in the open function of the USB driver and again in the disconnect function. Thanks to these two functions, USB drivers do not need to keep a static array of pointers that store the individual device structures for all current devices in the system. The indirect reference to device information allows an unlimited number of devices to be supported by any USB driver.

如果 USB 驱动程序未与处理用户与设备交互的其他类型的子系统(例如输入、tty、视频等)关联,则驱动程序可以使用 USB 主设备号以便使用传统的 char 驱动程序接口与用户空间。为此,当 USB 驱动程序想要向 USB 内核注册设备时,必须调用探测函数中的usb_register_dev函数。确保设备和驱动程序处于正确的状态,以便在调用此函数后立即处理想要访问设备的用户。

If the USB driver is not associated with another type of subsystem that handles the user interaction with the device (such as input, tty, video, etc.), the driver can use the USB major number in order to use the traditional char driver interface with user space. To do this, the USB driver must call the usb_register_dev function in the probe function when it wants to register a device with the USB core. Make sure that the device and driver are in a proper state to handle a user wanting to access the device as soon as this function is called.

/* 我们现在可以注册设备,因为它已经准备好了 */
retval = usb_register_dev(接口, &skel_class);
如果(返回值){
    /* 有些东西阻止我们注册这个驱动程序 */
    err(“无法获得此设备的未成年人。”);
    usb_set_intfdata(接口, NULL);
    转到错误;
}
/* we can register the device now, as it is ready */
retval = usb_register_dev(interface, &skel_class);
if (retval) {
    /* something prevented us from registering this driver */
    err("Not able to get a minor for this device.");
    usb_set_intfdata(interface, NULL);
    goto error;
}

usb_register_dev函数需要一个指向 a 的指针struct usb_interface和一个指向 a 的指针struct usb_class_driver。它struct usb_class_driver用于定义 USB 驱动程序希望 USB 内核在注册次要编号时了解的许多不同参数。该结构由以下变量组成:

The usb_register_dev function requires a pointer to a struct usb_interface and a pointer to a struct usb_class_driver. This struct usb_class_driver is used to define a number of different parameters that the USB driver wants the USB core to know when registering for a minor number. This structure consists of the following variables:

char *name
char *name

sysfs 用于描述设备的名称。前导路径名(如果存在)仅在 devfs 中使用,本书未涉及。如果名称中需要包含设备编号,则%d名称字符串中应包含该字符。例如,要创建 devfs 名称usb/foo1和 sysfs 类名称foo1,名称字符串应设置为usb/foo%d

The name that sysfs uses to describe the device. A leading pathname, if present, is used only in devfs and is not covered in this book. If the number of the device needs to be in the name, the characters %d should be in the name string. For example, to create the devfs name usb/foo1 and the sysfs class name foo1, the name string should be set to usb/foo%d.

struct file_operations *fops;
struct file_operations *fops;

指向struct file_operations该驱动程序已定义用于注册为字符设备的指针。有关此结构的更多信息,请参阅第 3 章。

Pointer to the struct file_operations that this driver has defined to use to register as the character device. See Chapter 3 for more information about this structure.

mode_t mode;
mode_t mode;

为此驱动程序创建的 devfs 文件的模式;否则未使用。S_IRUSR此变量的典型设置是与 value 组合的值S_IWUSR,这将仅提供设备文件所有者的读写访问权限。

The mode for the devfs file to be created for this driver; unused otherwise. A typical setting for this variable would be the value S_IRUSR combined with the value S_IWUSR, which would provide only read and write access by the owner of the device file.

int minor_base;
int minor_base;

这是为此驱动程序分配的次要范围的开始。与此驱动程序关联的所有设备都是使用唯一的、从该值开始递增的次要编号创建的。任一时间只允许 16 个设备与此驱动程序关联,除非CONFIG_USB_DYNAMIC_MINORS 已为内核启用配置选项。如果是,则忽略该变量,并且设备的所有次要编号都按照先到先得的方式分配。建议启用此选项的系统使用udev等程序来管理系统中的设备节点,因为静态/dev树将无法正常工作。

This is the start of the assigned minor range for this driver. All devices associated with this driver are created with unique, increasing minor numbers beginning with this value. Only 16 devices are allowed to be associated with this driver at any one time unless the CONFIG_USB_DYNAMIC_MINORS configuration option has been enabled for the kernel. If so, this variable is ignored, and all minor numbers for the device are allocated on a first-come, first-served manner. It is recommended that systems that have enabled this option use a program such as udev to manage the device nodes in the system, as a static /dev tree will not work properly.

当 USB 设备断开连接时,如果可能的话,应清除与该设备关联的所有资源。此时,如果在探测函数期间已调用 usb_register_dev为该 USB 设备分配次设备号,则必须调用函数 usb_deregister_dev将次设备号返回给 USB 内核。

When the USB device is disconnected, all resources associated with the device should be cleaned up, if possible. At this time, if usb_register_dev has been called to allocate a minor number for this USB device during the probe function, the function usb_deregister_dev must be called to give the minor number back to the USB core.

断开连接中 函数中,从接口检索之前通过调用usb_set_intfdata设置的任何数据也很重要。struct usb_interface然后在结构体中设置数据指针, NULL以防止不当访问数据时出现任何进一步的错误:

In the disconnect function, it is also important to retrieve from the interface any data that was previously set with a call to usb_set_intfdata. Then set the data pointer in the struct usb_interface structure to NULL to prevent any further mistakes in accessing the data improperly:

静态无效 skel_disconnect(结构 usb_interface *接口)
{
    结构usb_skel *dev;
    int 次要 = 接口->次要;

    /* 防止 skel_open( ) 与 skel_disconnect( ) 竞争 */
    锁_内核();

    dev = usb_get_intfdata(接口);
    usb_set_intfdata(接口, NULL);

    /* 返回我们的未成年人 */
    usb_deregister_dev(接口, &skel_class);

    解锁内核();

    /* 减少我们的使用计数 */
    kref_put(&dev->kref, skel_delete);

    info("USB 骨架 #%d 现已断开连接",minor);
}
static void skel_disconnect(struct usb_interface *interface)
{
    struct usb_skel *dev;
    int minor = interface->minor;

    /* prevent skel_open(  ) from racing skel_disconnect(  ) */
    lock_kernel(  );

    dev = usb_get_intfdata(interface);
    usb_set_intfdata(interface, NULL);

    /* give back our minor */
    usb_deregister_dev(interface, &skel_class);

    unlock_kernel(  );

    /* decrement our usage count */
    kref_put(&dev->kref, skel_delete);

    info("USB Skeleton #%d now disconnected", minor);
}

请注意前面的代码片段中对lock_kernel的调用。这需要大内核锁,以便断开回调在尝试获取指向正确接口数据结构的指针时不会遇到与 open 调用的竞争条件。因为 是open在获取大内核锁的情况下调用的,所以如果断开连接也获取相同的锁,则只有驱动程序的一部分可以访问并设置接口数据指针。

Note the call to lock_kernel in the previous code snippet. This takes the big kernel lock, so that the disconnect callback does not encounter a race condition with the open call when trying to get a pointer to the correct interface data structure. Because the open is called with the big kernel lock taken, if the disconnect also takes that same lock, only one portion of the driver can access and then set the interface data pointer.

在为 USB 设备调用断开连接函数之前,USB 核心会取消当前正在传输的所有 urb,因此驱动程序不必为这些 urb显式调用usb_kill_urb 。如果驱动程序在通过调用usb_submit_urb断开连接后尝试将 urb 提交到 USB 设备,则提交将失败并返回错误值-EPIPE

Just before the disconnect function is called for a USB device, all urbs that are currently in transmission for the device are canceled by the USB core, so the driver does not have to explicitly call usb_kill_urb for these urbs. If a driver tries to submit a urb to a USB device after it has been disconnected with a call to usb_submit_urb, the submission will fail with an error value of -EPIPE.

提交和控制 Urb

Submitting and Controlling a Urb

当驱动程序有数据要发送时 对于 USB 设备(通常发生在驱动程序的write 函数中), urb 必须是分配用于将数据传输到设备:

When the driver has data to send to the USB device (as typically happens in a driver's write function), a urb must be allocated for transmitting the data to the device:

urb = usb_alloc_urb(0, GFP_KERNEL);
如果(!urb){
    retval = -ENOMEM;
    转到错误;
}
urb = usb_alloc_urb(0, GFP_KERNEL);
if (!urb) {
    retval = -ENOMEM;
    goto error;
}

成功分配 urb 后,还应创建一个 DMA 缓冲区,以最有效的方式将数据发送到设备,并且应将传递给驱动程序的数据复制到该缓冲区中:

After the urb is allocated successfully, a DMA buffer should also be created to send the data to the device in the most efficient manner, and the data that is passed to the driver should be copied into that buffer:

buf = usb_buffer_alloc(dev->udev, count, GFP_KERNEL, &urb->transfer_dma);
如果(!buf){
    retval = -ENOMEM;
    转到错误;
}
if (copy_from_user(buf, user_buffer, count)) {
    retval = -EFAULT;
    转到错误;
}
buf = usb_buffer_alloc(dev->udev, count, GFP_KERNEL, &urb->transfer_dma);
if (!buf) {
    retval = -ENOMEM;
    goto error;
}
if (copy_from_user(buf, user_buffer, count)) {
    retval = -EFAULT;
    goto error;
}

一旦数据从用户空间正确复制到本地缓冲区,urb 必须正确初始化,然后才能提交到 USB 核心:

Once the data is properly copied from the user space into the local buffer, the urb must be initialized correctly before it can be submitted to the USB core:

/* 正确初始化 urb */
usb_fill_bulk_urb(urb, dev->udev,
          usb_sndbulkpipe(dev->udev, dev->bulk_out_endpointAddr),
          buf、计数、skel_write_bulk_callback、dev);
urb->transfer_flags |= URB_NO_TRANSFER_DMA_MAP;
/* initialize the urb properly */
usb_fill_bulk_urb(urb, dev->udev,
          usb_sndbulkpipe(dev->udev, dev->bulk_out_endpointAddr),
          buf, count, skel_write_bulk_callback, dev);
urb->transfer_flags |= URB_NO_TRANSFER_DMA_MAP;

现在 urb 已正确分配,数据已正确复制,并且 urb 已正确初始化,可以将其提交到 USB 核心以传输到设备:

Now that the urb is properly allocated, the data is properly copied, and the urb is properly initialized, it can be submitted to the USB core to be transmitted to the device:

/* 将数据发送出批量端口 */
retval = usb_submit_urb(urb, GFP_KERNEL);
如果(返回值){
    err("%s - 提交写入 urb 失败,错误 %d", _ _FUNCTION__, retval);
    转到错误;
}
/* send the data out the bulk port */
retval = usb_submit_urb(urb, GFP_KERNEL);
if (retval) {
    err("%s - failed submitting write urb, error %d", _ _FUNCTION_ _, retval);
    goto error;
}

urb 成功传输到 USB 设备后(或者传输过程中发生某些情况),urb 回调会被 USB 内核调用。在我们的示例中,我们将 urb 初始化为指向函数skel_write_bulk_callback,这就是被调用的函数:

After the urb is successfully transmitted to the USB device (or something happens in transmission), the urb callback is called by the USB core. In our example, we initialized the urb to point to the function skel_write_bulk_callback, and that is the function that is called:

静态无效 skel_write_bulk_callback(结构 urb *urb,结构 pt_regs *regs)
{
    /* 同步/异步取消链接故障不是错误 */
    if (urb->状态 &&
        !(urb->状态==-ENOENT ||
          urb->状态== -ECONNRESET ||
          urb->状态==-ESHUTDOWN)) {
        dbg("%s - 收到非零写入批量状态:%d",
            _ _FUNCTION_ _, urb->状态);
    }

    /* 释放我们分配的缓冲区 */
    usb_buffer_free(urb->dev, urb->transfer_buffer_length,
            urb->transfer_buffer, urb->transfer_dma);
}
static void skel_write_bulk_callback(struct urb *urb, struct pt_regs *regs)
{
    /* sync/async unlink faults aren't errors */
    if (urb->status && 
        !(urb->status =  = -ENOENT || 
          urb->status =  = -ECONNRESET ||
          urb->status =  = -ESHUTDOWN)) {
        dbg("%s - nonzero write bulk status received: %d",
            _ _FUNCTION_ _, urb->status);
    }

    /* free up our allocated buffer */
    usb_buffer_free(urb->dev, urb->transfer_buffer_length, 
            urb->transfer_buffer, urb->transfer_dma);
}

回调函数做的第一件事是检查 urb 的状态以确定该 urb 是否成功完成。错误值 、-ENOENT-ECONNRESET-ESHUTDOWN不是真正的传输错误,只是报告成功传输所伴随的情况。(请参阅第 13.3.1节中详细介绍的 urb 可能错误列表。)然后回调会释放分配给该 urb 进行传输的已分配缓冲区。

The first thing the callback function does is check the status of the urb to determine if this urb completed successfully or not. The error values, -ENOENT, -ECONNRESET, and -ESHUTDOWN are not real transmission errors, just reports about conditions accompanying a successful transmission. (See the list of possible errors for urbs detailed in the section Section 13.3.1.) Then the callback frees up the allocated buffer that was assigned to this urb to transmit.

当 urb 回调函数运行时,另一个 urb 被提交到设备是很常见的。当将数据流式传输到设备时,这非常有用。请记住,urb 回调是在中断上下文中运行的,因此它应该执行任何内存分配、保存任何信号量或执行任何可能导致进程休眠的其他操作。从回调中提交 urb 时,GFP_ATOMIC如果需要分配新内存,请使用该标志告诉 USB 核心不要休眠 提交过程中的块。

It's common for another urb to be submitted to the device while the urb callback function is running. This is useful when streaming data to a device. Remember that the urb callback is running in interrupt context, so it should do any memory allocation, hold any semaphores, or do anything else that could cause the process to sleep. When submitting a urb from within a callback, use the GFP_ATOMIC flag to tell the USB core to not sleep if it needs to allocate new memory chunks during the submission process.

无 Urb 的 USB 传输

USB Transfers Without Urbs

有时 USB 驱动程序会 不想经历创建 struct urb、初始化它,然后等待 urb 完成函数运行的所有麻烦,只是为了发送或接收一些简单的 USB 数据。有两个函数可以提供更简单的界面。

Sometimes a USB driver does not want to go through all of the hassle of creating a struct urb, initializing it, and then waiting for the urb completion function to run, just to send or receive some simple USB data. Two functions are available to provide a simpler interface.

USB_bulk_msg

usb_bulk_msg

usb_bulk_msg创建 USB 批量 urb 并将其发送到指定设备,然后等待其完成,然后返回调用者。它定义为:

usb_bulk_msg creates a USB bulk urb and sends it to the specified device, then waits for it to complete before returning to the caller. It is defined as:

int usb_bulk_msg(struct usb_device *usb_dev, 无符号 int 管道,
                 void *data, int len, int *actual_length,
                 int 超时);
int usb_bulk_msg(struct usb_device *usb_dev, unsigned int pipe,
                 void *data, int len, int *actual_length,
                 int timeout);

该函数的参数为​​:

The parameters of this function are:

struct usb_device *usb_dev
struct usb_device *usb_dev

指向要将批量消息发送到的 USB 设备的指针。

A pointer to the USB device to send the bulk message to.

unsigned int pipe
unsigned int pipe

此批量消息要发送到的 USB 设备的特定端点。该值是通过调用 usb_sndbulkpipeusb_rcvbulkpipe创建的。

The specific endpoint of the USB device to which this bulk message is to be sent. This value is created with a call to either usb_sndbulkpipe or usb_rcvbulkpipe.

void *data
void *data

如果这是 OUT 端点,则指向要发送到设备的数据的指针。如果这是 IN 端点,则这是一个指针,指向从设备读取数据后应放置的位置。

A pointer to the data to send to the device if this is an OUT endpoint. If this is an IN endpoint, this is a pointer to where the data should be placed after being read from the device.

int len
int len

参数指向的缓冲区的长度data

The length of the buffer that is pointed to by the data parameter.

int *actual_length
int *actual_length

指向函数放置已传输到设备或从设备接收的实际字节数的指针,具体取决于端点的方向。

A pointer to where the function places the actual number of bytes that have either been transferred to the device or received from the device, depending on the direction of the endpoint.

int timeout
int timeout

超时之前应等待的时间量(以 jiffies 为单位)。如果该值为0,则该函数将永远等待消息完成。

The amount of time, in jiffies, that should be waited before timing out. If this value is 0, the function waits forever for the message to complete.

如果函数成功,返回值为0;否则,返回负错误号。该错误号与之前在第 13.3.1 节中描述的 urbs 错误号相匹配。如果成功,该actual_length参数包含从此消息传输或接收的字节数。

If the function is successful, the return value is 0; otherwise, a negative error number is returned. This error number matches up with the error numbers previously described for urbs in Section 13.3.1. If successful, the actual_length parameter contains the number of bytes that were transferred or received from this message.

以下是使用此函数调用的示例:

The following is an example of using this function call:

/* 进行阻塞批量读取以从设备获取数据 */
retval = usb_bulk_msg(dev->udev,
              usb_rcvbulkpipe(dev->udev, dev->bulk_in_endpointAddr),
              dev->bulk_in_buffer,
              分钟(dev->bulk_in_size,计数),
              &count, HZ*10);

/* 如果读取成功,则将数据复制到用户空间 */
如果(!retval){
    if (copy_to_user(缓冲区, dev->bulk_in_buffer, 计数))
        retval = -EFAULT;
    别的
        retval = 计数;
}
/* do a blocking bulk read to get data from the device */
retval = usb_bulk_msg(dev->udev,
              usb_rcvbulkpipe(dev->udev, dev->bulk_in_endpointAddr),
              dev->bulk_in_buffer,
              min(dev->bulk_in_size, count),
              &count, HZ*10);

/* if the read was successful, copy the data to user space */
if (!retval) {
    if (copy_to_user(buffer, dev->bulk_in_buffer, count))
        retval = -EFAULT;
    else
        retval = count;
}

此示例显示了从 IN 端点进行的简单批量读取。如果读取成功,则将数据复制到用户空间。read这通常在USB 驱动程序的函数中完成。

This example shows a simple bulk read from an IN endpoint. If the read is successful, the data is then copied to user space. This is typically done in a read function for a USB driver.

usb_bulk_msg函数不能在中断上下文中或在持有自旋锁的情况下调用。另外,该功能不能被其他功能取消,所以使用时要小心;确保您的驱动程序的 断开连接知道足够的信息以等待调用完成,然后再允许其从内存中卸载。

The usb_bulk_msg function cannot be called from within interrupt context or with a spinlock held. Also, this function cannot be canceled by any other function, so be careful when using it; make sure that your driver's disconnect knows enough to wait for the call to complete before allowing itself to be unloaded from memory.

USB_控制_消息

usb_control_msg

usb_control_msg函数的工作方式与usb_bulk_msg函数类似 ,只不过它允许驱动程序发送和接收 USB 控制消息:

The usb_control_msg function works just like the usb_bulk_msg function, except it allows a driver to send and receive USB control messages:

int usb_control_msg(struct usb_device *dev, 无符号 int 管道,
                    _ _u8 请求, _ _u8 请求类型,
                    _ _u16 值, _ _u16 索引,
                    void *data, _ _u16 大小, int 超时);
int usb_control_msg(struct usb_device *dev, unsigned int pipe,
                    _ _u8 request, _ _u8 requesttype,
                    _ _u16 value, _ _u16 index,
                    void *data, _ _u16 size, int timeout);

该函数的参数与usb_bulk_msg几乎相同 ,但有一些重要的区别:

The parameters of this function are almost the same as usb_bulk_msg, with a few important differences:

struct usb_device *dev
struct usb_device *dev

指向要向其发送控制消息的 USB 设备的指针。

A pointer to the USB device to send the control message to.

unsigned int pipe
unsigned int pipe

此控制消息要发送到的 USB 设备的特定端点。该值是通过调用usb_sndctrlpipeusb_rcvctrlpipe创建的。

The specific endpoint of the USB device that this control message is to be sent to. This value is created with a call to either usb_sndctrlpipe or usb_rcvctrlpipe.

_ _u8 request
_ _u8 request

控制消息的 USB 请求值。

The USB request value for the control message.

_ _u8 requesttype
_ _u8 requesttype

控制消息的 USB 请求类型值

The USB request type value for the control message

_ _u16 value
_ _u16 value

控制消息的 USB 消息值。

The USB message value for the control message.

_ _u16 index
_ _u16 index

控制消息的 USB 消息索引值。

The USB message index value for the control message.

void *data
void *data

如果这是 OUT 端点,则指向要发送到设备的数据的指针。如果这是 IN 端点,则这是一个指针,指向从设备读取数据后应放置的位置。

A pointer to the data to send to the device if this is an OUT endpoint. If this is an IN endpoint, this is a pointer to where the data should be placed after being read from the device.

_ _u16 size
_ _u16 size

参数指向的缓冲区的大小data

The size of the buffer that is pointed to by the data parameter.

int timeout
int timeout

超时之前应等待的时间量(以 jiffies 为单位)。如果该值为0,则该函数将永远等待消息完成。

The amount of time, in jiffies, that should be waited before timing out. If this value is 0, the function will wait forever for the message to complete.

如果该函数成功,它将返回传入或传出设备的字节数。如果不成功,则返回负错误号。

If the function is successful, it returns the number of bytes that were transferred to or from the device. If it is not successful, it returns a negative error number.

参数requestrequesttypevalueindex全部直接映射到 USB 规范,了解如何定义 USB 控制消息。有关这些参数的有效值及其使用方式的更多信息,请参阅 USB 规范的第 9 章。

The parameters request, requesttype, value, and index all directly map to the USB specification for how a USB control message is defined. For more information on the valid values for these parameters and how they are used, see Chapter 9 of the USB specification.

与函数usb_bulk_msg一样,函数 usb_control_msg不能在中断上下文中或在持有自旋锁的情况下调用。另外,该功能不能被其他功能取消,所以使用时要小心;确保您的驱动程序断开连接函数了解足够的信息,可以等待调用完成,然后再将其自身从内存中卸载。

Like the function usb_bulk_msg, the function usb_control_msg cannot be called from within interrupt context or with a spinlock held. Also, this function cannot be canceled by any other function, so be careful when using it; make sure that your driver disconnect function knows enough to wait for the call to complete before allowing itself to be unloaded from memory.

其他USB数据功能

Other USB Data Functions

一些辅助函数USB 核心中的数据可用于从所有 USB 设备检索标准信息。这些函数不能在中断上下文中或在持有自旋锁的情况下调用。

A number of helper functions in the USB core can be used to retrieve standard information from all USB devices. These functions cannot be called from within interrupt context or with a spinlock held.

功能 usb_get_descriptor从指定设备检索指定的 USB 描述符。该函数定义为:

The function usb_get_descriptor retrieves the specified USB descriptor from the specified device. The function is defined as:

int usb_get_descriptor(struct usb_device *dev, 无符号字符类型,
                       无符号字符索引,void *buf,int 大小);
int usb_get_descriptor(struct usb_device *dev, unsigned char type,
                       unsigned char index, void *buf, int size);

USB 驱动程序可以使用此函数从结构中检索现有和结构struct usb_device中尚未存在的任何设备描述符,例如音频描述符或其他类特定信息。该函数的参数为​​:struct usb_devicestruct usb_interface

This function can be used by a USB driver to retrieve from the struct usb_device structure any of the device descriptors that are not already present in the existing struct usb_device and struct usb_interface structures, such as audio descriptors or other class specific information. The parameters of the function are:

struct usb_device *usb_dev
struct usb_device *usb_dev

指向应从中检索描述符的 USB 设备的指针。

A pointer to the USB device that the descriptor should be retrieved from.

unsigned char type
unsigned char type

描述符类型。该类型在 USB 规范中进行了描述,可以是以下类型之一:

The descriptor type. This type is described in the USB specification and can be one of the following types:

USB_DT_DEVICE
USB_DT_CONFIG
USB_DT_STRING
USB_DT_接口
USB_DT_ENDPOINT
USB_DT_DEVICE_QUALIFIER
USB_DT_OTHER_SPEED_CONFIG
USB_DT_INTERFACE_POWER
USB_DT_OTG
USB_DT_DEBUG
USB_DT_INTERFACE_ASSOCIATION
USB_DT_CS_DEVICE
USB_DT_CS_CONFIG
USB_DT_CS_STRING
USB_DT_CS_INTERFACE
USB_DT_CS_ENDPOINT
USB_DT_DEVICE
USB_DT_CONFIG
USB_DT_STRING
USB_DT_INTERFACE
USB_DT_ENDPOINT
USB_DT_DEVICE_QUALIFIER
USB_DT_OTHER_SPEED_CONFIG
USB_DT_INTERFACE_POWER
USB_DT_OTG
USB_DT_DEBUG
USB_DT_INTERFACE_ASSOCIATION
USB_DT_CS_DEVICE
USB_DT_CS_CONFIG
USB_DT_CS_STRING
USB_DT_CS_INTERFACE
USB_DT_CS_ENDPOINT
unsigned char index
unsigned char index

应从设备检索的描述符的编号。

The number of the descriptor that should be retrieved from the device.

void *buf
void *buf

指向将描述符复制到的缓冲区的指针。

A pointer to the buffer to which you copy the descriptor.

int size
int size

变量指向的内存大小buf

The size of the memory pointed to by the buf variable.

如果此函数成功,它将返回从设备读取的字节数。否则,它将返回一个负错误号,该错误号是由该函数对usb_control_msg 的底层调用返回的 。

If this function is successful, it returns the number of bytes read from the device. Otherwise, it returns a negative error number returned by the underlying call to usb_control_msg that this function makes.

usb_get_descriptor调用最常见的用途之一是从 USB 设备检索字符串。因为这很常见,所以有一个名为usb_get_string的辅助函数:

One of the more common uses for the usb_get_descriptor call is to retrieve a string from the USB device. Because this is quite common, there is a helper function for it called usb_get_string:

int usb_get_string(struct usb_device *dev, 无符号短 langid,
                   无符号字符索引,void *buf,int 大小);
int usb_get_string(struct usb_device *dev, unsigned short langid,
                   unsigned char index, void *buf, int size);

如果成功,此函数将返回设备接收到的字符串字节数。否则,它将返回一个负错误号,该错误号是由该函数对usb_control_msg 的底层调用返回的。

If successful, this function returns the number of bytes received by the device for the string. Otherwise, it returns a negative error number returned by the underlying call to usb_control_msg that this function makes.

如果此函数成功,它将在参数指向的缓冲区中返回一个以 UTF-16LE 格式(Unicode,每个字符 16 位,小端字节顺序)编码的字符串buf。由于这种格式通常不太有用,因此还有另一个名为usb_string 的函数,它返回从 USB 设备读取并已转换为 ISO 8859-1 格式字符串的字符串。该字符集是 Unicode 的 8 位子集,是英语和其他西欧语言中最常见的字符串格式。由于这通常是 USB 设备字符串的格式,因此建议 使用usb_string函数 而不是 usb_get_string函数。

If this function is successful, it returns a string encoded in the UTF-16LE format (Unicode, 16 bits per character, in little-endian byte order) in the buffer pointed to by the buf parameter. As this format is usually not very useful, there is another function, called usb_string, that returns a string that is read from a USB device and is already converted into an ISO 8859-1 format string. This character set is a 8-bit subset of Unicode and is the most common format for strings in English and other Western European languages. As this is typically the format that the USB device's strings are in, it is recommended that the usb_string function be used instead of the usb_get_string function.

快速参考

Quick Reference

本节总结了本章中介绍的符号:

This section summarizes the symbols introduced in the chapter:

#include <linux/usb.h>
#include <linux/usb.h>

与 USB 相关的所有内容所在的头文件。所有 USB 设备驱动程序都必须包含它。

Header file where everything related to USB resides. It must be included by all USB device drivers.

struct usb_driver;
struct usb_driver;

描述 USB 驱动程序的结构。

Structure that describes a USB driver.

struct usb_device_id;
struct usb_device_id;

描述此驱动程序支持的 USB 设备类型的结构。

Structure that describes the types of USB devices this driver supports.

int usb_register(struct usb_driver *d);

void usb_deregister(struct usb_driver *d);
int usb_register(struct usb_driver *d);

void usb_deregister(struct usb_driver *d);

用于从 USB 内核注册和取消注册 USB 驱动程序的函数。

Functions used to register and unregister a USB driver from the USB core.

struct usb_device *interface_to_usbdev(struct usb_interface *intf);
struct usb_device *interface_to_usbdev(struct usb_interface *intf);

struct usb_device *检索a 的控制struct usb_interface *

Retrieves the controlling struct usb_device * out of a struct usb_interface *.

struct usb_device;
struct usb_device;

控制整个USB设备的结构。

Structure that controls an entire USB device.

struct usb_interface;
struct usb_interface;

所有 USB 驱动程序用于与 USB 核心通信的主要 USB 设备结构。

Main USB device structure that all USB drivers use to communicate with the USB core.

void usb_set_intfdata(struct usb_interface *intf, void *data);

void *usb_get_intfdata(struct usb_interface *intf);
void usb_set_intfdata(struct usb_interface *intf, void *data);

void *usb_get_intfdata(struct usb_interface *intf);

用于设置和访问 struct usb_interface.

Functions to set and get access to the private data pointer section within the struct usb_interface.

struct usb_class_driver;
struct usb_class_driver;

描述想要使用 USB 主设备号与用户空间程序进行通信的 USB 驱动程序的结构。

A structure that describes a USB driver that wants to use the USB major number to communicate with user-space programs.

int usb_register_dev(struct usb_interface *intf, struct usb_class_driver

*class_driver);

void usb_deregister_dev(struct usb_interface *intf, struct usb_class_driver

*class_driver);
int usb_register_dev(struct usb_interface *intf, struct usb_class_driver

*class_driver);

void usb_deregister_dev(struct usb_interface *intf, struct usb_class_driver

*class_driver);

用于向结构注册和取消注册特定struct usb_interface *结构的函数struct usb_class_driver *

Functions used to register and unregister a specific struct usb_interface * structure with a struct usb_class_driver * structure.

struct urb;
struct urb;

描述USB数据传输的结构。

Structure that describes a USB data transmission.

struct urb *usb_alloc_urb(int iso_packets, int mem_flags);

void usb_free_urb(struct urb *urb);
struct urb *usb_alloc_urb(int iso_packets, int mem_flags);

void usb_free_urb(struct urb *urb);

用于创建和销毁struct usb urb *.

Functions used to create and destroy a struct usb urb *.

int usb_submit_urb(struct urb *urb, int mem_flags);

int usb_kill_urb(struct urb *urb);

int usb_unlink_urb(struct urb *urb);
int usb_submit_urb(struct urb *urb, int mem_flags);

int usb_kill_urb(struct urb *urb);

int usb_unlink_urb(struct urb *urb);

用于启动和停止 USB 数据传输的函数。

Functions used to start and stop a USB data transmission.

void usb_fill_int_urb(struct urb *urb, struct usb_device *dev, unsigned int

pipe, void *transfer_buffer, int buffer_length, usb_complete_t complete,

,void *context, int interval);

void usb_fill_bulk_urb(struct urb *urb, struct usb_device *dev, unsigned int

pipe, void *transfer_buffer, int buffer_length, usb_complete_t complete

void *context);

void usb_fill_control_urb(struct urb *urb, struct usb_device *dev, unsigned

int pipe, unsigned char *setup_packet, void *transfer_buffer, int

buffer_ length, usb_complete_t complete, void *context);
void usb_fill_int_urb(struct urb *urb, struct usb_device *dev, unsigned int

pipe, void *transfer_buffer, int buffer_length, usb_complete_t complete,

void *context, int interval);

void usb_fill_bulk_urb(struct urb *urb, struct usb_device *dev, unsigned int

pipe, void *transfer_buffer, int buffer_length, usb_complete_t complete,

void *context);

void usb_fill_control_urb(struct urb *urb, struct usb_device *dev, unsigned

int pipe, unsigned char *setup_packet, void *transfer_buffer, int

buffer_ length, usb_complete_t complete, void *context);

struct urb 用于在将 a提交到 USB 核心之前对其进行初始化的函数。

Functions used to initialize a struct urb before it is submitted to the USB core.

int usb_bulk_msg(struct usb_device *usb_dev, unsigned int pipe, void *data,

, ,int len, int *actual_length, int timeout);

int usb_control_msg(struct usb_device *dev, unsigned int pipe, _ _u8 request

_ _u8 requesttype, _ _u16 value, _ _u16 index, void *data, _ _u16 size

int timeout);
int usb_bulk_msg(struct usb_device *usb_dev, unsigned int pipe, void *data,

int len, int *actual_length, int timeout);

int usb_control_msg(struct usb_device *dev, unsigned int pipe, _ _u8 request,

_ _u8 requesttype, _ _u16 value, _ _u16 index, void *data, _ _u16 size,

int timeout);

用于发送或接收 USB 数据的函数,无需使用struct urb.

Functions used to send or receive USB data without having to use a struct urb.




[ 1 ]本章的部分内容基于 Linux 内核 USB 代码的内核文档,这些文档由内核 USB 开发人员编写并在 GPL 下发布。

[1] Portions of this chapter are based on the in-kernel documentation for the Linux kernel USB code, which were written by the kernel USB developers and released under the GPL.

[ 2 ]实际上,存在一些结构,但它主要减少了对通信的要求,以适应几个预定义类别之一:例如,键盘不会分配带宽,而某些摄像机会分配带宽。

[2] Actually, some structure is there, but it mostly reduces to a requirement for the communication to fit into one of a few predefined classes: a keyboard won't allocate bandwidth, for example, while some video cameras will.

第 14 章 Linux 设备模型

Chapter 14. The Linux Device Model

2.5 开发周期的既定目标之一是为内核创建统一的设备模型。 以前的内核没有单一的数据结构可以用来获取有关系统如何组合在一起的信息。尽管缺乏信息,但在一段时间内一切进展顺利。然而,新系统的需求及其更复杂的拓扑结构以及支持电源管理等功能的需求清楚地表明,需要一个描述系统结构的通用抽象。

One of the stated goals for the 2.5 development cycle was the creation of a unified device model for the kernel. Previous kernels had no single data structure to which they could turn to obtain information about how the system is put together. Despite this lack of information, things worked well for some time. The demands of newer systems, with their more complicated topologies and need to support features such as power management, made it clear, however, that a general abstraction describing the structure of the system was needed.

2.6 设备模型提供了这种抽象。它现在在内核中使用来支持各种任务,包括:

The 2.6 device model provides that abstraction. It is now used within the kernel to support a wide variety of tasks, including:

电源管理和系统关闭
Power management and system shutdown

这些需要一个 了解系统的结构。例如,在处理连接到该适配器的所有设备之前,不能关闭 USB 主机适配器。设备模型使得能够以正确的顺序遍历系统的硬件。

These require an understanding of the system's structure. For example, a USB host adaptor cannot be shut down before dealing with all of the devices connected to that adaptor. The device model enables a traversal of the system's hardware in the right order.

与用户空间的通信
Communications with user space

实施 sysfs 虚拟文件系统与设备模型紧密相关,并公开其所表示的结构。向用户空间和用于更改操作参数的旋钮提供有关系统的信息越来越多地通过 sysfs 完成,因此也通过设备模型完成。

The implementation of the sysfs virtual filesystem is tightly tied into the device model and exposes the structure represented by it. The provision of information about the system to user space and knobs for changing operating parameters is increasingly done through sysfs and, therefore, through the device model.

热插拔设备
Hotpluggable devices

电脑硬件 越来越有活力;外围设备可以根据用户的喜好来来去去。内核中用于处理设备插入和拔出以及(特别是)与用户空间进行通信的热插拔机制是通过设备模型进行管理的。

Computer hardware is increasingly dynamic; peripherals can come and go at the whim of the user. The hotplug mechanism used within the kernel to handle and (especially) communicate with user space about the plugging and unplugging of devices is managed through the device model.

设备类别
Device classes

系统的许多部分都有 他们对设备如何连接不感兴趣,但他们需要知道有哪些类型的设备可用。设备模型包括一种将设备分配给 类的机制,该类在更高的功能级别描述这些设备,并允许从用户空间发现它们。

Many parts of the system have little interest in how devices are connected, but they need to know what kinds of devices are available. The device model includes a mechanism for assigning devices to classes, which describe those devices at a higher, functional level and allow them to be discovered from user space.

对象生命周期
Object lifecycles

许多 上述功能(包括热插拔支持和 sysfs)使内核中创建的对象的创建和操作变得复杂。设备模型的实现需要创建一组机制来处理对象生命周期、它们之间的关系以及它们在用户空间中的表示。

Many of the functions described above, including hotplug support and sysfs, complicate the creation and manipulation of objects created within the kernel. The implementation of the device model required the creation of a set of mechanisms for dealing with object lifecycles, their relationships to each other, and their representation in user space.

Linux设备模型是一个复杂的数据结构。例如,考虑第 14 章,它(以简化形式)显示了与 USB 鼠标相关的设备模型结构的一小部分。在图的中心,我们看到核心“设备”树的一部分,它显示了鼠标如何连接到系统。“总线”树跟踪连接到每条总线的内容,而“类”下的子树则关注设备提供的功能,无论它们如何连接。即使是一个简单的系统上的设备模型树也包含数百个节点,如图所示;这是一个很难从整体上可视化的数据结构。

The Linux device model is a complex data structure. For example, consider Chapter 14, which shows (in simplified form) a tiny piece of the device model structure associated with a USB mouse. Down the center of the diagram, we see the part of the core "devices" tree that shows how the mouse is connected to the system. The "bus" tree tracks what is connected to each bus, while the subtree under "classes" concerns itself with the functions provided by the devices, regardless of how they are connected. The device model tree on even a simple system contains hundreds of nodes like those shown in the diagram; it is a difficult data structure to visualize as a whole.

设备模型的一小部分

图 14-1。设备模型的一小部分

Figure 14-1. A small piece of the device model

大多数情况下,Linux 设备模型代码会考虑所有这些注意事项,而不会将其强加给驱动程序作者。它大部分位于背景中;与设备模型的直接交互通常由总线级逻辑和各种其他内核子系统处理。因此,许多驱动程序作者可以完全忽略设备模型,并相信它会自行处理。

For the most part, the Linux device model code takes care of all these considerations without imposing itself upon driver authors. It sits mostly in the background; direct interaction with the device model is generally handled by bus-level logic and various other kernel subsystems. As a result, many driver authors can ignore the device model entirely, and trust it to take care of itself.

然而,有时了解设备模型是一件好事。有时设备模型会从其他层后面“泄漏”;例如,通用 DMA 代码(我们在第 15 章中遇到)适用于struct device. 您可能想要使用设备模型提供的一些功能,例如 kobjects 提供的引用计数和相关功能。通过sysfs与用户空间通信也是一个设备模型函数;本章解释了这种沟通是如何进行的。

There are times, however, when an understanding of the device model is a good thing to have. There are times when the device model "leaks out" from behind the other layers; for example, the generic DMA code (which we encounter in Chapter 15) works with struct device. You may want to use some of the capabilities provided by the device model, such as the reference counting and related features provided by kobjects. Communication with user space via sysfs is also a device model function; this chapter explains how that communication works.

然而,我们从自下而上地介绍设备模型开始。设备模型的复杂性使得从高层视图开始很难理解。我们希望,通过展示低级设备组件的工作原理,我们可以帮助您做好准备,应对掌握如何使用这些组件构建更大结构的挑战。

We start, however, with a bottom-up presentation of the device model. The complexity of the device model makes it hard to understand by starting with a high-level view. Our hope is that, by showing how the low-level device components work, we can prepare you for the challenge of grasping how those components are used to build the larger structure.

对于许多读者来说,本章可以视为高级材料,无需第一次通读。我们鼓励那些对 Linux 设备模型如何工作感兴趣的人继续前进,但是, 当我们进入底层细节时。

For many readers, this chapter can be treated as advanced material that need not be read the first time through. Those who are interested in how the Linux device model works are encouraged to press ahead, however, as we get into the low-level details.

Kobject、Kset 和子系统

Kobjects, Ksets, and Subsystems

kobject是_ 基本结构将设备模型结合在一起。它最初被设想为一个简单的参考计数器,但随着时间的推移,它的职责不断增长,它的字段也不断增长。现在处理的任务 struct kobject及其支持代码包括:

The kobject is the fundamental structure that holds the device model together. It was initially conceived as a simple reference counter, but its responsibilities have grown over time, and so have its fields. The tasks handled by struct kobject and its supporting code now include:

对象的引用计数
Reference counting of objects

通常,当创建内核对象时,无法知道它会存在多久。跟踪此类对象生命周期的一种方法是通过引用计数。当内核中没有代码保存对给定对象的引用时,该对象已完成其使用寿命并且可以被删除。

Often, when a kernel object is created, there is no way to know just how long it will exist. One way of tracking the lifecycle of such objects is through reference counting. When no code in the kernel holds a reference to a given object, that object has finished its useful life and can be deleted.

系统文件系统表示
Sysfs representation

sysfs 中显示的每个对象在其下面都有一个 kobject,它与内核交互以创建其可见表示。

Every object that shows up in sysfs has, underneath it, a kobject that interacts with the kernel to create its visible representation.

数据结构胶水
Data structure glue

整个设备模型是一个极其复杂的数据结构,由多个层次结构组成,它们之间有许多链接。kobject 实现了这个结构并将其组合在一起。

The device model is, in its entirety, a fiendishly complicated data structure made up of multiple hierarchies with numerous links between them. The kobject implements this structure and holds it together.

热插拔事件处理
Hotplug event handling

kobject 子系统处理事件的生成,通知用户空间有关系统上硬件的情况。

The kobject subsystem handles the generation of events that notify user space about the comings and goings of hardware on the system.

从前面的列表中,人们可能会得出这样的结论:kobject 是一种复杂的结构。一个是对的。然而,通过一次查看一件作品,可以理解这种结构及其工作原理。

One might conclude from the preceding list that the kobject is a complicated structure. One would be right. By looking at one piece at a time, however, it is possible to understand this structure and how it works.

对象基础知识

Kobject Basics

kobject 具有类型struct kobject它在<linux/kobject.h>中定义。该文件还包括与 kobject 相关的许多其他结构的声明,当然还有一长串用于操作它们的函数。

A kobject has the type struct kobject; it is defined in <linux/kobject.h>. That file also includes declarations for a number of other structures related to kobjects and, of course, a long list of functions for manipulating them.

嵌入kobjects

Embedding kobjects

在我们进入之前 详细信息,值得花点时间来了解 kobject 的使用方式。如果您回顾 kobject 处理的函数列表,您会发现它们都是代表其他对象执行的服务。换句话说,kobject 本身没有什么意义。它的存在只是为了将更高级别的对象绑定到设备模型中。

Before we get into the details, it is worth taking a moment to understand how kobjects are used. If you look back at the list of functions handled by kobjects, you see that they are all services performed on behalf of other objects. A kobject, in other words, is of little interest on its own; it exists only to tie a higher-level object into the device model.

因此,内核代码很少(甚至未知)创建独立的 kobject;相反,kobject 用于控制对更大的、特定于域的对象的访问。为此,kobject 被嵌入到其他结构中。如果您习惯于用面向对象的术语来思考事物,那么 kobject 可以被视为一个顶级的抽象类,其他类都派生自该类。kobject 实现了一组功能,这些功能本身并不是特别有用,但在其他对象中却很有用。C 语言不允许直接表达继承,因此必须使用其他技术(例如将一种结构嵌入另一种结构)。

Thus, it is rare (even unknown) for kernel code to create a standalone kobject; instead, kobjects are used to control access to a larger, domain-specific object. To this end, kobjects are found embedded in other structures. If you are used to thinking of things in object-oriented terms, kobjects can be seen as a top-level, abstract class from which other classes are derived. A kobject implements a set of capabilities that are not particularly useful by themselves but that are nice to have in other objects. The C language does not allow for the direct expression of inheritance, so other techniques—such as embedding one structure in another—must be used.

作为一个例子,让我们回顾一下struct cdev我们在第 3 章中遇到的。2.6.10 内核中的结构如下所示:

As an example, let's look back at struct cdev, which we encountered in Chapter 3. That structure, as found in the 2.6.10 kernel, looks like this:

结构体cdev {
    结构体 kobject kobj;
    结构模块*所有者;
    结构文件操作 *ops;
    struct list_head 列表;
    dev_t 开发;
    无符号整数计数;
};
struct cdev {
    struct kobject kobj;
    struct module *owner;
    struct file_operations *ops;
    struct list_head list;
    dev_t dev;
    unsigned int count;
};

正如我们所看到的,该cdev结构中嵌入了一个 kobject。如果您有这些结构之一,找到其嵌入的 kobject 只需使用该kobj字段即可。然而,使用 kobject 的代码通常会遇到相反的问题:给定一个struct kobject指针,指向包含结构的指针是什么?您应该避免使用技巧(例如假设 kobject 位于结构的开头),而应使用 container_of (在第 3.5.1 节中介绍)。因此,转换指针的方法嵌入 在 a 中的struct kobject被调用将是:kpstruct cdev

As we can see, the cdev structure has a kobject embedded within it. If you have one of these structures, finding its embedded kobject is just a matter of using the kobj field. Code that works with kobjects often has the opposite problem, however: given a struct kobject pointer, what is the pointer to the containing structure? You should avoid tricks (such as assuming that the kobject is at the beginning of the structure), and, instead, use the container_of macro (introduced in Section 3.5.1. So the way to convert a pointer to a struct kobject called kp embedded within a struct cdev would be:

struct cdev *device = container_of(kp, struct cdev, kobj);
struct cdev *device = container_of(kp, struct cdev, kobj);

程序员经常为“back-casting”kobject 指针指向包含类型。

Programmers often define a simple macro for "back-casting" kobject pointers to the containing type.

Kobject初始化

Kobject initialization

本书提出了一些具有在编译或运行时初始化的简单机制的类型。kobject 的初始化有点复杂,特别是当它的所有函数都被使用时。然而,无论如何使用 kobject,都必须执行几个步骤。

This book has presented a number of types with simple mechanisms for initialization at compile or runtime. The initialization of a kobject is a bit more complicated, especially when all of its functions are used. Regardless of how a kobject is used, however, a few steps must be performed.

第一个是简单地将整个 kobject 设置为0,通常通过调用memset。通常,这种初始化是作为 kobject 嵌入到的结构清零的一部分而发生的。未能将 kobject 归零通常会导致后续非常奇怪的崩溃;这不是您想要跳过的步骤。

The first of those is to simply set the entire kobject to 0, usually with a call to memset. Often this initialization happens as part of the zeroing of the structure into which the kobject is embedded. Failure to zero out a kobject often leads to very strange crashes further down the line; it is not a step you want to skip.

下一步是通过调用 kobject_init()设置一些内部字段:

The next step is to set up some of the internal fields with a call to kobject_init( ):

无效kobject_init(结构kobject * kobj);
void kobject_init(struct kobject *kobj);

其中,kobject_init将 kobject 的引用计数设置为 1。然而,调用kobject_init还不够。Kobject 用户至少必须设置 kobject 的名称;这是 sysfs 条目中使用的名称。如果您深入研究内核源代码,您可以找到将字符串直接复制到 kobject 字段的代码name ,但应该避免这种方法。相反,使用:

Among other things, kobject_init sets the kobject's reference count to one. Calling kobject_init is not sufficient, however. Kobject users must, at a minimum, set the name of the kobject; this is the name that is used in sysfs entries. If you dig through the kernel source, you can find the code that copies a string directly into the kobject's name field, but that approach should be avoided. Instead, use:

int kobject_set_name(struct kobject *kobj, const char *format, ...);
int kobject_set_name(struct kobject *kobj, const char *format, ...);

该函数采用printk样式的变量参数列表。不管你相信与否,这个操作实际上有可能失败(它可能会尝试分配内存);认真的代码应该检查返回值并做出相应的反应。

This function takes a printk-style variable argument list. Believe it or not, it is actually possible for this operation to fail (it may try to allocate memory); conscientious code should check the return value and react accordingly.

应该由创建者直接或间接设置的其他 kobject 字段是ktypeksetparent。我们将在本章稍后讨论这些内容。

The other kobject fields that should be set, directly or indirectly, by the creator are ktype, kset, and parent. We will get to these later in this chapter.

引用计数操纵

Reference count manipulation

关键功能之一 kobject 的作用是充当它所嵌入的对象的引用计数器。只要对该对象的引用存在,该对象(以及支持它的代码)就必须继续存在。用于操作 kobject 引用计数的低级函数是:

One of the key functions of a kobject is to serve as a reference counter for the object in which it is embedded. As long as references to the object exist, the object (and the code that supports it) must continue to exist. The low-level functions for manipulating a kobject's reference counts are:

结构体 kobject *kobject_get(结构体 kobject *kobj);
无效kobject_put(结构kobject * kobj);
struct kobject *kobject_get(struct kobject *kobj);
void kobject_put(struct kobject *kobj);

成功调用kobject_get会增加 kobject 的引用计数器并返回指向 kobject 的指针。然而,如果 kobject 已经在被销毁的过程中,则操作失败,并且 kobject_get返回NULL。必须始终测试此返回值,否则可能会导致不愉快的竞争条件。

A successful call to kobject_get increments the kobject's reference counter and returns a pointer to the kobject. If, however, the kobject is already in the process of being destroyed, the operation fails, and kobject_get returns NULL. This return value must always be tested, or no end of unpleasant race conditions could result.

当释放引用时,对kobject_put 的调用 会减少引用计数,并可能释放该对象。请记住, kobject_init将引用计数设置为 1;因此,当您创建一个 kobject 时,您应该确保 当不再需要该初始引用时进行相应的kobject_put调用。

When a reference is released, the call to kobject_put decrements the reference count and, possibly, frees the object. Remember that kobject_init sets the reference count to one; so when you create a kobject, you should make sure that the corresponding kobject_put call is made when that initial reference is no longer needed.

请注意,在许多情况下,kobject 本身的引用计数可能不足以防止竞争条件。例如,kobject(及其包含结构)的存在很可能需要创建该 kobject 的模块的持续存在。当 kobject 仍在传递时卸载该模块是行不通的。这就是为什么cdev我们上面看到的结构包含一个struct module指针。引用计数的struct cdev实现如下:

Note that, in many cases, the reference count in the kobject itself may not be sufficient to prevent race conditions. The existence of a kobject (and its containing structure) may well, for example, require the continued existence of the module that created that kobject. It would not do to unload that module while the kobject is still being passed around. That is why the cdev structure we saw above contains a struct module pointer. Reference counting for struct cdev is implemented as follows:

结构 kobject *cdev_get(结构 cdev *p)
{
    结构模块 *owner = p->owner;
    结构体 kobject *kobj;

    if (所有者 && !try_module_get(所有者))
        返回空值;
    kobj = kobject_get(&p->kobj);
    如果(!kobj)
        module_put(所有者);
    返回 kobj;
}
struct kobject *cdev_get(struct cdev *p)
{
    struct module *owner = p->owner;
    struct kobject *kobj;

    if (owner && !try_module_get(owner))
        return NULL;
    kobj = kobject_get(&p->kobj);
    if (!kobj)
        module_put(owner);
    return kobj;
}

创建对结构的引用cdev还需要创建对拥有该结构的模块的引用。因此cdev_get 使用try_module_get尝试增加该模块的使用计数。如果该操作成功,kobject_get也会用于增加 kobject 的引用计数。当然,该操作可能会失败,因此代码会检查kobject_get的返回值,并在出现问题时释放其对模块的引用。

Creating a reference to a cdev structure requires creating a reference also to the module that owns it. So cdev_get uses try_module_get to attempt to increment that module's usage count. If that operation succeeds, kobject_get is used to increment the kobject's reference count as well. That operation could fail, of course, so the code checks the return value from kobject_get and releases its reference to the module if things don't work out.

释放函数和 kobject 类型

Release functions and kobject types

一件重要的事情 讨论中仍然缺少的是当 kobject 的引用计数达到 时会发生什么0。创建 kobject 的代码通常不知道什么时候会发生;如果确实如此,那么首先使用引用计数就没有意义了。当引入 sysfs 时,即使是可预测的对象生命周期也会变得更加复杂;用户空间程序可以在任意时间段内保留对 kobject 的引用(通过保持其关联的 sysfs 文件之一打开)。

One important thing still missing from the discussion is what happens to a kobject when its reference count reaches 0. The code that created the kobject generally does not know when that will happen; if it did, there would be little point in using a reference count in the first place. Even predictable object life cycles become more complicated when sysfs is brought in; user-space programs can keep a reference to a kobject (by keeping one of its associated sysfs files open) for an arbitrary period of time.

最终结果是,受 kobject 保护的结构不能在驱动程序生命周期中的任何单个、可预测的点上释放,但在必须准备好在 kobject 的引用计数达到的任何时刻运行的代码中释放0。引用计数不受创建 kobject 的代码的直接控制。因此,只要对其 kobject 之一的最后一个引用消失,就必须异步通知该代码。

The end result is that a structure protected by a kobject cannot be freed at any single, predictable point in the driver's lifecycle, but in code that must be prepared to run at whatever moment the kobject's reference count goes to 0. The reference count is not under the direct control of the code that created the kobject. So that code must be notified asynchronously whenever the last reference to one of its kobjects goes away.

该通知是通过 kobject 的释放来完成的 方法。通常,该方法的形式如下:

This notification is done through a kobject's release method. Usually, this method has a form such as:

void my_object_release(struct kobject *kobj)
{
    struct my_object *mine = container_of(kobj, struct my_object, kobj);

    /* 对此对象执行任何额外的清理,然后... */
    kfree(我的);
}
void my_object_release(struct kobject *kobj)
{
    struct my_object *mine = container_of(kobj, struct my_object, kobj);

    /* Perform any additional cleanup on this object, then... */
    kfree(mine);
}

重要的一点怎么强调都不为过:每个 kobject 都必须有一个 release方法,并且 kobject 必须持续存在(处于一致的状态)直到该方法被调用。如果不满足这些约束,则代码就有缺陷。它存在在对象仍在使用时释放该对象的风险,或者在返回最后一个引用后无法释放该对象的风险。

One important point cannot be overstated: every kobject must have a release method, and the kobject must persist (in a consistent state) until that method is called. If these constraints are not met, the code is flawed. It risks freeing the object when it is still in use, or it fails to release the object after the last reference is returned.

有趣的是,release方法并不存储在kobject本身中;相反,它与包含 kobject 的结构的类型相关联。该类型通过 type 结构进行跟踪struct kobj_type,通常简称为“ktype”。该结构如下所示:

Interestingly, the release method is not stored in the kobject itself; instead, it is associated with the type of the structure that contains the kobject. This type is tracked with a structure of type struct kobj_type, often simply called a "ktype." This structure looks like the following:

结构体 kobj_type {
    void (*release)(struct kobject *);
    结构 sysfs_ops *sysfs_ops;
    结构属性**default_attrs;
};
struct kobj_type {
    void (*release)(struct kobject *);
    struct sysfs_ops *sysfs_ops;
    struct attribute **default_attrs;
};

当然, release其中的字段是指向该类型kobject的release方法的指针。我们将在本章稍后讨论其他两个字段 (和)。struct kobj_typesysfs_opsdefault_attrs

The release field in struct kobj_type is, of course, a pointer to the release method for this type of kobject. We will come back to the other two fields (sysfs_ops and default_attrs) later in this chapter.

每个 kobject 都需要有一个关联的kobj_type结构。令人困惑的是,可以在两个不同的地方找到指向该结构的指针。kobject 结构本身包含一个 ktype可以包含此指针的字段(称为 )。但是,如果此 kobject 是 kset 的成员,则kobj_type 指针将由该 kset 提供。(我们将在下一节中讨论 kset。)同时,宏:

Every kobject needs to have an associated kobj_type structure. Confusingly, the pointer to this structure can be found in two different places. The kobject structure itself contains a field (called ktype) that can contain this pointer. If, however, this kobject is a member of a kset, the kobj_type pointer is provided by that kset instead. (We will look at ksets in the next section.) Meanwhile, the macro:

结构 kobj_type *get_ktype(结构 kobject *kobj);
struct kobj_type *get_ktype(struct kobject *kobj);

查找kobj_type给定 kobject 的指针。

finds the kobj_type pointer for a given kobject.

Kobject 层次结构、Kset 和子系统

Kobject Hierarchies, Ksets, and Subsystems

kobject结构通常是 用于将对象链接在一起形成与正在建模的子系统的结构相匹配的层次结构。这种链接有两种独立的机制:parent指针和 kset。

The kobject structure is often used to link together objects into a hierarchical structure that matches the structure of the subsystem being modeled. There are two separate mechanisms for this linking: the parent pointer and ksets.

parentin 中的字段是struct kobject指向另一个 kobject 的指针,该 kobject 代表层次结构中的下一层。例如,如果 kobject 代表 USB 设备,则其parent指针可以指示代表该设备所插入的集线器的对象。

The parent field in struct kobject is a pointer to another kobject—the one representing the next level up in the hierarchy. If, for example, a kobject represents a USB device, its parent pointer may indicate the object representing the hub into which the device is plugged.

指针的主要用途parent是在 sysfs 层次结构中定位对象。我们将在第 14.2 节中了解它是如何工作的。

The main use for the parent pointer is to position the object in the sysfs hierarchy. We'll see how this works in Section 14.2.

克塞茨

Ksets

在许多方面,kset 看起来像是kobj_type结构的扩展;kset 是嵌入在相同类型的结构中的 kobject 的集合。然而,虽然struct kobj_type关注对象的类型,但 struct kset也关注聚合和集合。这两个概念已被分开,以便相同类型的对象可以出现在不同的集合中。

In many ways, a kset looks like an extension of the kobj_type structure; a kset is a collection of kobjects embedded within structures of the same type. However, while struct kobj_type concerns itself with the type of an object, struct kset is concerned with aggregation and collection. The two concepts have been separated so that objects of identical type can appear in distinct sets.

因此,kset的主要功能是包含;它可以被认为是 kobject 的顶级容器类。事实上,每个 kset 内部都包含自己的 kobject,并且在很多方面都可以将其视为 kobject。值得注意的是,kset 总是在 sysfs 中表示;一旦一个 kset 被设置并添加到系统中,就会有一个 sysfs 目录。Kobject 不一定出现在 sysfs 中,但作为 kset 成员的每个 kobject 都在那里表示。

Therefore, the main function of a kset is containment; it can be thought of as the top-level container class for kobjects. In fact, each kset contains its own kobject internally, and it can, in many ways, be treated the same way as a kobject. It is worth noting that ksets are always represented in sysfs; once a kset has been set up and added to the system, there will be a sysfs directory for it. Kobjects do not necessarily show up in sysfs, but every kobject that is a member of a kset is represented there.

将 kobject 添加到 kset 通常是在创建对象时完成的;这是一个两步过程。kobject的kset字段必须指向感兴趣的kset;那么 kobject 应该传递给:

Adding a kobject to a kset is usually done when the object is created; it is a two-step process. The kobject's kset field must be pointed at the kset of interest; then the kobject should be passed to:

int kobject_add(struct kobject *kobj);
int kobject_add(struct kobject *kobj);

与往常一样,程序员应该意识到该函数可能会失败(在这种情况下它会返回负错误代码)并做出相应的响应。内核提供了一个方便的函数:

As always, programmers should be aware that this function can fail (in which case it returns a negative error code) and respond accordingly. There is a convenience function provided by the kernel:

extern int kobject_register(struct kobject *kobj);
extern int kobject_register(struct kobject *kobj);

该函数只是kobject_initkobject_add的组合。

This function is simply a combination of kobject_init and kobject_add.

当一个 kobject 被传递给kobject_add时,它的引用计数就会增加。毕竟,kset 中的包含是对对象的引用。在某些时候,kobject 可能必须从 kset 中删除才能清除该引用;这是通过以下方式完成的:

When a kobject is passed to kobject_add, its reference count is incremented. Containment within the kset is, after all, a reference to the object. At some point, the kobject will probably have to be removed from the kset to clear that reference; that is done with:

无效kobject_del(结构kobject * kobj);
void kobject_del(struct kobject *kobj);

还有一个kobject_unregister函数,它是kobject_delkobject_put的组合。

There is also a kobject_unregister function, which is a combination of kobject_del and kobject_put.

kset 将其子项保存在标准内核链表中。在几乎所有情况下,所包含的 kobject 在其父级字段中也有指向 kset(或者严格地说,其嵌入的 kobject)的指针。因此,通常情况下,kset 及其 kobject 看起来类似于图 14-1中所看到的那样。请记住:

A kset keeps its children in a standard kernel linked list. In almost all cases, the contained kobjects also have pointers to the kset (or, strictly, its embedded kobject) in their parent's fields. So, typically, a kset and its kobjects look something like what you see in Figure 14-1. Bear in mind that:

  • 图中包含的所有 kobject 实际上都嵌入在其他类型中,甚至可能是其他 kset 中。

  • All of the contained kobjects in the diagram are actually embedded within some other type, possibly even other ksets.

  • 不要求 kobject 的父级是包含 kset(尽管任何其他组织都会很奇怪且罕见)。

  • It is not required that a kobject's parent be the containing kset (although any other organization would be strange and rare).

一个简单的 kset 层次结构

图 14-2。一个简单的 kset 层次结构

Figure 14-2. A simple kset hierarchy

对 kset 的操作

Operations on ksets

对于初始化和设置,ksets 有一个 界面非常相似kobjects 的那个。存在以下功能:

For initialization and setup, ksets have an interface very similar to that of kobjects. The following functions exist:

无效kset_init(结构kset * kset);
int kset_add(struct kset *kset);
int kset_register(struct kset *kset);
无效kset_unregister(结构kset * kset);
void kset_init(struct kset *kset);
int kset_add(struct kset *kset);
int kset_register(struct kset *kset);
void kset_unregister(struct kset *kset);

大多数情况下,这些功能只需在 kset 的嵌入 kobject 上调用类似的kobject_函数即可。

For the most part, these functions just call the analogous kobject_ function on the kset's embedded kobject.

管理ksets的引用计数,情况大致相同:

To manage the reference counts of ksets, the situation is about the same:

结构体 kset *kset_get(结构体 kset *kset);
无效kset_put(结构kset * kset);
struct kset *kset_get(struct kset *kset);
void kset_put(struct kset *kset);

kset 还有一个名称,存储在嵌入的 kobject 中。因此,如果您有一个名为 的 kset my_set,您可以使用以下命令设置其名称:

A kset also has a name, which is stored in the embedded kobject. So, if you have a kset called my_set, you would set its name with:

kobject_set_name(&my_set->kobj, "名称");
kobject_set_name(&my_set->kobj, "The name");

Ksets 也有一个指针(在ktype字段中),指向kobj_type描述它所包含的 kobjects 的结构。该类型优先于ktypekobject 本身中的字段使用。因此,在典型用法中, ktypein 中的字段struct kobject被 left NULL,因为 kset 中的同一字段是实际使用的字段。

Ksets also have a pointer (in the ktype field) to the kobj_type structure describing the kobjects it contains. This type is used in preference to the ktype field in a kobject itself. As a result, in typical usage, the ktype field in struct kobject is left NULL, because the same field within the kset is the one actually used.

最后,kset 包含子系统指针(称为 subsys)。现在是时候讨论子系统了。

Finally, a kset contains a subsystem pointer (called subsys). So it's time to talk about subsystems.

子系统

Subsystems

子系统是一个 整个内核的高级部分的表示。子系统通常(但并非总是)出现在 sysfs 层次结构的顶部。内核中的一些示例子系统包括block_subsys/sys/block,用于块设备)、devices_subsys/sys/devices,核心设备层次结构)以及针对内核已知的每种总线类型的特定子系统。驱动程序作者几乎不需要创建新的子系统;如果您想这样做,请再考虑一下。最后,您可能想要的是添加一个新类,如第 14.5 节中所述。

A subsystem is a representation for a high-level portion of the kernel as a whole. Subsystems usually (but not always) show up at the top of the sysfs hierarchy. Some example subsystems in the kernel include block_subsys (/sys/block, for block devices), devices_subsys (/sys/devices, the core device hierarchy), and a specific subsystem for every bus type known to the kernel. A driver author almost never needs to create a new subsystem; if you feel tempted to do so, think again. What you probably want, in the end, is to add a new class, as discussed in Section 14.5.

子系统由一个简单的结构表示:

A subsystem is represented by a simple structure:

结构子系统{
    结构体kset kset;
    结构体rw_semaphore rwsem;
};
struct subsystem {
    struct kset kset;
    struct rw_semaphore rwsem;
};

因此,子系统实际上只是一个 kset 的包装器,其中包含一个信号量。

A subsystem, thus, is really just a wrapper around a kset, with a semaphore thrown in.

每个 kset 必须属于一个子系统。子系统成员资格有助于建立 kset 在层次结构中的位置,但更重要的是,子系统的rwsem信号量用于序列化对 kset 内部链表的访问。subsys该成员资格由中的指针表示struct kset。这样,我们可以从kset的结构中找到每个kset包含的子系统,但无法直接从子系统结构中找到子系统中包含的多个kset。

Every kset must belong to a subsystem. The subsystem membership helps establish the kset's position in the hierarchy, but, more importantly, the subsystem's rwsem semaphore is used to serialize access to a kset's internal-linked list. This membership is represented by the subsys pointer in struct kset. Thus, one can find each kset's containing subsystem from the kset's structure, but one cannot find the multiple ksets contained in a subsystem directly from the subsystem structure.

子系统通常用特殊的宏来声明:

Subsystems are often declared with a special macro:

decl_subsys(名称, struct kobj_type *类型,
            结构 kset_hotplug_ops *hotplug_ops);
decl_subsys(name, struct kobj_type *type, 
            struct kset_hotplug_ops *hotplug_ops);

该宏创建一个 ,struct subsystem其名称是通过将给name定的宏获取并附加到_subsys它而形成的。type该宏还使用给定的和初始化内部 kset hotplug_ops。(我们将在本章后面讨论热插拔操作。)

This macro creates a struct subsystem with a name formed by taking the name given to the macro and appending _subsys to it. The macro also initializes the internal kset with the given type and hotplug_ops. (We discuss hotplug operations later in this chapter.)

子系统具有通常的设置和拆卸功能列表:

Subsystems have the usual list of setup and teardown functions:

无效子系统_init(结构子系统* subsys);
int 子系统_寄存器(结构子系统 *subsys);
无效子系统_unregister(结构子系统* subsys);
结构子系统 *subsys_get(结构子系统 *subsys)
无效 subsys_put(struct 子系统 *subsys);
void subsystem_init(struct subsystem *subsys);
int subsystem_register(struct subsystem *subsys);
void subsystem_unregister(struct subsystem *subsys);
struct subsystem *subsys_get(struct subsystem *subsys)
void subsys_put(struct subsystem *subsys);

大多数这些操作只是作用于 子系统的 kset。

Most of these operations just act upon the subsystem's kset.

低级 Sysfs 操作

Low-Level Sysfs Operations

Kobjects 是机制 sysfs虚拟背后 文件系统。对于 sysfs 中找到的每个目录,内核中的某处都潜伏着一个 kobject。每个感兴趣的 kobject 还导出一个或多个属性,这些属性作为包含内核生成信息的文件出现在该 kobject 的 sysfs 目录中。本节研究 kobject 和 sysfs 如何在低级别交互。

Kobjects are the mechanism behind the sysfs virtual filesystem. For every directory found in sysfs, there is a kobject lurking somewhere within the kernel. Every kobject of interest also exports one or more attributes, which appear in that kobject's sysfs directory as files containing kernel-generated information. This section examines how kobjects and sysfs interact at a low level.

使用 sysfs 的代码应包含<linux/sysfs.h>

Code that works with sysfs should include <linux/sysfs.h>.

让 kobject 显示在 sysfs 中只需调用 kobject_add即可。我们已经看到该函数是将 kobject 添加到 kset 的方法;在 sysfs 中创建条目也是其工作的一部分。关于 sysfs 条目的创建方式,有几件事值得了解:

Getting a kobject to show up in sysfs is simply a matter of calling kobject_add. We have already seen that function as the way to add a kobject to a kset; creating entries in sysfs is also part of its job. There are a couple of things worth knowing about how the sysfs entry is created:

  • kobject 的 Sysfs 条目始终是目录,因此调用 kobject_add会导致在 sysfs 中创建目录。通常该目录包含一个或多个属性;我们很快就会看到如何指定属性。

  • Sysfs entries for kobjects are always directories, so a call to kobject_add results in the creation of a directory in sysfs. Usually that directory contains one or more attributes; we see how attributes are specified shortly.

  • 分配给 kobject 的名称(使用kobject_set_name)是用于 sysfs 目录的名称。因此,出现在 sysfs 层次结构的同一部分中的 kobject 必须具有唯一的名称。分配给 kobject 的名称也应该是合理的文件名:它们不能包含斜杠字符,并且强烈建议不要使用空格。

  • The name assigned to the kobject (with kobject_set_name) is the name used for the sysfs directory. Thus, kobjects that appear in the same part of the sysfs hierarchy must have unique names. Names assigned to kobjects should also be reasonable file names: they cannot contain the slash character, and the use of white space is strongly discouraged.

  • sysfs 条目位于与 kobjectparent指针对应的目录中。如果parent 是,NULL当调用kobject_add时,它被设置为嵌入到新kobject的kset中的kobject;因此,sysfs 层次结构通常与使用 kset 创建的内部层次结构相匹配。如果 和parent都是ksetNULL则 sysfs 目录将在顶层创建,这几乎肯定不是您想要的。

  • The sysfs entry is located in the directory corresponding to the kobject's parent pointer. If parent is NULL when kobject_add is called, it is set to the kobject embedded in the new kobject's kset; thus, the sysfs hierarchy usually matches the internal hierarchy created with ksets. If both parent and kset are NULL, the sysfs directory is created at the top level, which is almost certainly not what you want.

使用我们到目前为止描述的机制,我们可以使用 kobject 在 sysfs 中创建一个空目录。通常,您想做一些比这更有趣的事情,所以是时候看看属性的实现了。

Using the mechanisms we have described so far, we can use a kobject to create an empty directory in sysfs. Usually, you want to do something a little more interesting than that, so it is time to look at the implementation of attributes.

默认属性

Default Attributes

创建时,每个 kobject 都会被赋予一组 默认属性。这些属性通过结构的方式指定kobj_type。请记住,该结构如下所示:

When created, every kobject is given a set of default attributes. These attributes are specified by way of the kobj_type structure. That structure, remember, looks like this:

结构体 kobj_type {
    void (*release)(struct kobject *);
    结构 sysfs_ops *sysfs_ops;
    结构属性**default_attrs;
};
struct kobj_type {
    void (*release)(struct kobject *);
    struct sysfs_ops *sysfs_ops;
    struct attribute **default_attrs;
};

default_attrsfield 列出了要为该类型的每个 kobject 创建的属性,并sysfs_ops提供实现这些属性的方法。我们从 开始default_attrs,它指向一个结构体指针数组attribute

The default_attrs field lists the attributes to be created for every kobject of this type, and sysfs_ops provides the methods to implement those attributes. We start with default_attrs, which points to an array of pointers to attribute structures:

结构体属性{
    字符*名称;
    结构模块*所有者;
    mode_t模式;
};
struct attribute {
    char *name;
    struct module *owner;
    mode_t mode;
};

在此结构中,name是属性的名称(出现在 kobject 的 sysfs 目录中),owner是指向负责实现该属性的模块(如果有)的指针,mode是要被保护的保护位。应用于该属性。该模式通常S_IRUGO用于只读属性;如果该属性是可写的,则可以S_IWUSR只为 root 提供写访问权限(模式的宏在<linux/stat.h>中定义)。列表中的最后一项default_attrs必须用零填充。

In this structure, name is the name of the attribute (as it appears within the kobject's sysfs directory), owner is a pointer to the module (if any) that is responsible for the implementation of this attribute, and mode is the protection bits that are to be applied to this attribute. The mode is usually S_IRUGO for read-only attributes; if the attribute is writable, you can toss in S_IWUSR to give write access to root only (the macros for modes are defined in <linux/stat.h>). The last entry in the default_attrs list must be zero-filled.

default_attrs数组说明了属性是什么,但没有告诉 sysfs 如何实际实现这些属性。该任务落在该kobj_type->sysfs_ops字段上,该字段指向一个定义为的结构:

The default_attrs array says what the attributes are but does not tell sysfs how to actually implement those attributes. That task falls to the kobj_type->sysfs_ops field, which points to a structure defined as:

结构 sysfs_ops {
    ssize_t (*show)(struct kobject *kobj, struct attribute *attr,
                    字符*缓冲区);
    ssize_t (*store)(struct kobject *kobj, struct attribute *attr,
                     const char *缓冲区,size_t 大小);
};
struct sysfs_ops {
    ssize_t (*show)(struct kobject *kobj, struct attribute *attr, 
                    char *buffer);
    ssize_t (*store)(struct kobject *kobj, struct attribute *attr, 
                     const char *buffer, size_t size);
};

每当从用户空间读取属性时,都会显示 使用指向 kobject 的指针和适当的attribute结构来调用方法。该方法应该将给定属性的值编码为buffer,确保不会超出它(它是PAGE_SIZE字节),并返回返回数据的实际长度。sysfs 的约定规定每个属性应包含一个人类可读的值;如果您有很多信息要返回,您可能需要考虑将其拆分为多个属性。

Whenever an attribute is read from user space, the show method is called with a pointer to the kobject and the appropriate attribute structure. That method should encode the value of the given attribute into buffer, being sure not to overrun it (it is PAGE_SIZE bytes), and return the actual length of the returned data. The conventions for sysfs state that each attribute should contain a single, human-readable value; if you have a lot of information to return, you may want to consider splitting it into multiple attributes.

与给定 kobject 关联的所有属性都使用相同的show方法。传递到函数中的指针attr可用于确定正在请求哪个属性。一些 显示方法包括对属性名称的一系列测试。其他实现将该结构嵌入attribute到另一个结构中,该结构包含返回属性值所需的信息;在这种情况下,可以在show方法中使用 container_of来获取指向嵌入结构的指针。

The same show method is used for all attributes associated with a given kobject. The attr pointer passed into the function can be used to determine which attribute is being requested. Some show methods include a series of tests on the attribute name. Other implementations embed the attribute structure within another structure that contains the information needed to return the attribute's value; in this case, container_of may be used within the show method to obtain a pointer to the embedding structure.

商店_ 方法类似;它应该解码存储在buffersize包含该数据的长度,不超过PAGE_SIZE)的数据,以任何有意义的方式存储和响应新值,并返回实际解码的字节数。仅当属性的权限允许写入时才能调用store方法 。写商店的时候方法,永远不要忘记您正在从用户空间接收任意信息;在采取任何响应措施之前,您应该非常仔细地验证它。如果传入的数据与预期不符,则返回负错误值,而不是可能执行不需要且无法恢复的操作。如果您的设备导出一个self_destruct 属性,您应该要求在那里写入一个特定的字符串来调用该功能;意外的随机写入应该只会产生错误。

The store method is similar; it should decode the data stored in buffer (size contains the length of that data, which does not exceed PAGE_SIZE), store and respond to the new value in whatever way makes sense, and return the number of bytes actually decoded. The store method can be called only if the attribute's permissions allow writes. When writing a store method, never forget that you are receiving arbitrary information from user space; you should validate it very carefully before taking any action in response. If the incoming data does not match expectations, return a negative error value rather than possibly doing something unwanted and unrecoverable. If your device exports a self_destruct attribute, you should require that a specific string be written there to invoke that functionality; an accidental, random write should yield only an error.

非默认属性

Nondefault Attributes

在许多情况下, kobject 类型的default_attrs字段描述了 kobject 将具有的所有属性。但这并不是设计上的限制。可以随意向 kobject 添加和删除属性。如果您希望向 kobject 的 sysfs 目录添加新属性,只需填写一个attribute结构并将其传递给:

In many cases, the kobject type's default_attrs field describes all the attributes that kobject will ever have. But that's not a restriction in the design; attributes can be added and removed to kobjects at will. If you wish to add a new attribute to a kobject's sysfs directory, simply fill in an attribute structure and pass it to:

int sysfs_create_file(struct kobject *kobj, struct attribute *attr);
int sysfs_create_file(struct kobject *kobj, struct attribute *attr);

如果一切顺利,将使用结构中给定的名称创建文件attribute,返回值为0; 否则,返回通常的负错误代码。

If all goes well, the file is created with the name given in the attribute structure, and the return value is 0; otherwise, the usual negative error code is returned.

注意,调用相同的show()store() 函数来实现对新属性的操作。在向 kobject 添加新的非默认属性之前,您应该采取一切必要的步骤来确保这些函数知道如何实现该属性。

Note that the same show( ) and store( ) functions are called to implement operations on the new attribute. Before you add a new, nondefault attribute to a kobject, you should take whatever steps are necessary to ensure that those functions know how to implement that attribute.

要删除属性, 称呼:

To remove an attribute, call:

int sysfs_remove_file(struct kobject *kobj, struct attribute *attr);
int sysfs_remove_file(struct kobject *kobj, struct attribute *attr);

调用后,该属性不再出现在 kobject 的 sysfs 条目中。但请注意,用户空间进程可能具有该属性的打开文件描述符,并且在删除该属性后仍然可以进行显示存储调用。

After the call, the attribute no longer appears in the kobject's sysfs entry. Do be aware, however, that a user-space process could have an open file descriptor for that attribute and that show and store calls are still possible after the attribute has been removed.

二进制属性

Binary Attributes

sysfs 约定调用 让所有属性都包含人类可读文本格式的单个值。也就是说,偶尔、很少需要创建可以处理更大的二进制数据块的属性。只有当数据必须在用户空间和设备之间不受影响地传递时,这种需求才会真正出现。例如,将固件上传到设备就需要此功能。当系统中遇到这样的设备时,可以启动用户空间程序(通过热插拔机制);然后该程序通过二进制 sysfs 属性将固件代码传递给内核,如第 14.8.1 节所示。

The sysfs conventions call for all attributes to contain a single value in a human-readable text format. That said, there is an occasional, rare need for the creation of attributes that can handle larger chunks of binary data. That need really only comes about when data must be passed, untouched, between user space and the device. For example, uploading firmware to devices requires this feature. When such a device is encountered in the system, a user-space program can be started (via the hotplug mechanism); that program then passes the firmware code to the kernel via a binary sysfs attribute, as is shown in Section 14.8.1.

二进制属性用bin_attribute 结构:

Binary attributes are described with a bin_attribute structure:

结构体 bin_attribute {
    结构体属性attr;
    size_t 尺寸;
    ssize_t (*read)(struct kobject *kobj, char *buffer,
                    loff_t 位置,size_t 大小);
    ssize_t (*write)(struct kobject *kobj, char *buffer,
                    loff_t 位置,size_t 大小);
};
struct bin_attribute {
    struct attribute attr;
    size_t size;
    ssize_t (*read)(struct kobject *kobj, char *buffer, 
                    loff_t pos, size_t size);
    ssize_t (*write)(struct kobject *kobj, char *buffer, 
                    loff_t pos, size_t size);
};

这里,attrattribute给出二进制属性的名称、所有者和权限的结构,并且size是二进制属性的最大大小(或者0如果没有最大值)。读取 和写入方法工作方式与普通的字符驱动程序类似;对于单次加载,可以多次调用它们,每次调用最多加载一页数据。sysfs 无法发出一组写操作的最后一个信号,因此实现二进制属性的代码必须能够以其他方式确定数据的结尾。

Here, attr is an attribute structure giving the name, owner, and permissions for the binary attribute, and size is the maximum size of the binary attribute (or 0 if there is no maximum). The read and write methods work similarly to the normal char driver equivalents; they can be called multiple times for a single load with a maximum of one page worth of data in each call. There is no way for sysfs to signal the last of a set of write operations, so code implementing a binary attribute must be able to determine the end of the data some other way.

必须显式创建二进制属性;它们不能设置为默认属性。要创建二进制属性,请调用:

Binary attributes must be created explicitly; they cannot be set up as default attributes. To create a binary attribute, call:

int sysfs_create_bin_file(结构 kobject *kobj,
                          结构 bin_attribute *attr);
int sysfs_create_bin_file(struct kobject *kobj, 
                          struct bin_attribute *attr);

可以通过以下方式删除二进制属性:

Binary attributes can be removed with:

int sysfs_remove_bin_file(结构 kobject *kobj,
                          结构 bin_attribute *attr);
int sysfs_remove_bin_file(struct kobject *kobj, 
                          struct bin_attribute *attr);

符号链接

Symbolic Links

sysfs 文件系统 具有通常的树结构,反映了它所代表的 kobject 的层次结构。然而,内核中对象之间的关系通常比这更复杂。例如,一个 sysfs 子树 ( /sys/devices ) 代表系统已知的所有设备,而其他子树(位于/sys/bus下)代表设备驱动程序。然而,这些树并不代表驱动程序和它们管理的设备之间的关系。显示这些附加关系需要额外的指针,这些指针在 sysfs 中是通过符号链接实现的。

The sysfs filesystem has the usual tree structure, reflecting the hierarchical organization of the kobjects it represents. The relationships between objects in the kernel are often more complicated than that, however. For example, one sysfs subtree (/sys/devices) represents all of the devices known to the system, while other subtrees (under /sys/bus) represent the device drivers. These trees do not, however, represent the relationships between the drivers and the devices they manage. Showing these additional relationships requires extra pointers which, in sysfs, are implemented through symbolic links.

在 sysfs 中创建符号链接很容易:

Creating a symbolic link within sysfs is easy:

int sysfs_create_link(结构 kobject *kobj, 结构 kobject *target,
                      字符*名称);
int sysfs_create_link(struct kobject *kobj, struct kobject *target,
                      char *name);

name该函数创建一个指向targetsysfs 条目的链接(称为)作为 的属性kobj。它是一个相对链接,因此无论 sysfs 安装在任何特定系统上的哪个位置,它都可以工作。

This function creates a link (called name) pointing to target's sysfs entry as an attribute of kobj. It is a relative link, so it works regardless of where sysfs is mounted on any particular system.

即使target从系统中删除,该链接仍然存在。如果您要创建到其他 kobject 的符号链接,您可能应该有一种方法来了解这些 kobject 的更改,或者以某种方式保证目标 kobject 不会消失。后果(sysfs 内的死符号链接)并不是特别严重,但它们并不代表最好的结果 编程风格和 可能会导致用户空间混乱。

The link persists even if target is removed from the system. If you are creating symbolic links to other kobjects, you should probably have a way of knowing about changes to those kobjects, or some sort of assurance that the target kobjects will not disappear. The consequences (dead symbolic links within sysfs) are not particularly grave, but they are not representative of the best programming style and can cause confusion in user space.

符号链接可以是删除:

Symbolic links can be removed with:

void sysfs_remove_link(struct kobject *kobj, char *name);
void sysfs_remove_link(struct kobject *kobj, char *name);

热插拔事件生成

Hotplug Event Generation

热插拔事件 通知给 用户空间从内核得知系统配置发生了某些变化。每当创建或销毁 kobject 时都会生成它们。例如,当使用 USB 电缆插入数码相机时、当用户切换控制台模式时或当磁盘重新分区时,会生成此类事件。Hotplug 事件变成对/sbin/hotplug的调用,它可以通过加载驱动程序、创建设备节点、安装分区或采取任何其他适当的操作来响应每个事件。

A hotplug event is a notification to user space from the kernel that something has changed in the system's configuration. They are generated whenever a kobject is created or destroyed. Such events are generated, for example, when a digital camera is plugged in with a USB cable, when a user switches console modes, or when a disk is repartitioned. Hotplug events turn into an invocation of /sbin/hotplug, which can respond to each event by loading drivers, creating device nodes, mounting partitions, or taking any other action that is appropriate.

我们看的最后一个主要 kobject 函数是这些事件的生成。当 kobject 传递给kobject_addkobject_del时,实际的事件生成发生。在事件被传递到用户空间之前,与 kobject(或更具体地说,它所属的 kset)关联的代码有机会为用户空间添加信息或完全禁用事件生成。

The last major kobject function we look at is the generation of these events. The actual event generation takes place when a kobject is passed to kobject_add or kobject_del. Before the event is handed to user space, code associated with the kobject (or, more specifically, the kset to which it belongs) has the opportunity to add information for user space or to disable event generation entirely.

热插拔操作

Hotplug Operations

热插拔事件的实际控制是 通过存储在中的一组方法来执行kset_hotplug_ops 结构:

Actual control of hotplug events is exercised by way of a set of methods stored in the kset_hotplug_ops structure:

结构 kset_hotplug_ops {
    int (*filter)(struct kset *kset, struct kobject *kobj);
    char *(*name)(struct kset *kset, struct kobject *kobj);
    int (*hotplug)(结构 kset *kset, 结构 kobject *kobj,
                   char **envp,int num_envp,char *缓冲区,
                   int缓冲区大小);
};
struct kset_hotplug_ops {
    int (*filter)(struct kset *kset, struct kobject *kobj);
    char *(*name)(struct kset *kset, struct kobject *kobj);
    int (*hotplug)(struct kset *kset, struct kobject *kobj, 
                   char **envp, int num_envp, char *buffer, 
                   int buffer_size);
};

hotplug_ops在kset 结构的字段中可以找到指向该结构的指针。如果给定的 kobject 不包含在 kset 中,内核会在层次结构中向上搜索(通过指针),直到找到包含ksetparent的 kobject ;然后使用 kset 的热插拔操作。

A pointer to this structure is found in the hotplug_ops field of the kset structure. If a given kobject is not contained within a kset, the kernel searchs up through the hierarchy (via the parent pointer) until it finds a kobject that does have a kset; that kset's hotplug operations are then used.

过滤 每当内核考虑为给定的 kobject 生成事件时,就会调用热插拔操作。如果过滤器 返回0,则不会创建事件。因此,此方法使 kset 代码有机会决定哪些事件应该传递到用户空间,哪些不应该传递。

The filter hotplug operation is called whenever the kernel is considering generating an event for a given kobject. If filter returns 0, the event is not created. This method, therefore, gives the kset code an opportunity to decide which events should be passed on to user space and which should not.

作为如何使用此方法的示例,请考虑块子系统。这里至少使用了三种类型的 kobject,分别代表磁盘、分区和请求队列。用户空间可能想要对磁盘或分区的添加做出反应,但它通常不关心请求队列。因此,filter方法只允许为代表磁盘和分区的 kobject 生成事件。它看起来像这样:

As an example of how this method might be used, consider the block subsystem. There are at least three types of kobjects used there, representing disks, partitions, and request queues. User space may want to react to the addition of a disk or a partition, but it does not normally care about request queues. So the filter method allows event generation only for kobjects representing disks and partitions. It looks like this:

static int block_hotplug_filter(结构 kset *kset, 结构 kobject *kobj)
{
    结构 kobj_type *ktype = get_ktype(kobj);

    return ((ktype == &ktype_block) || (ktype == &ktype_part));
}
static int block_hotplug_filter(struct kset *kset, struct kobject *kobj)
{
    struct kobj_type *ktype = get_ktype(kobj);

    return ((ktype =  = &ktype_block) || (ktype =  = &ktype_part));
}

这里,对 kobject 类型的快速测试足以决定是否应该生成事件。

Here, a quick test on the type of kobject is sufficient to decide whether the event should be generated or not.

当用户空间热插拔程序被调用时,它被传递给相关子系统的名称作为其唯一的参数。名称热插拔方法负责提供该名称。它应该返回一个适合传递到用户空间的简单字符串。

When the user-space hotplug program is invoked, it is passed to the name of the relevant subsystem as its one and only parameter. The name hotplug method is charged with providing that name. It should return a simple string suitable for passing to user space.

热插拔脚本可能想知道的所有其他内容都在环境中传递。最后的热插拔方法 ( hotplug ) 提供了在调用该脚本之前添加有用的环境变量的机会。同样,该方法的原型是:

Everything else that the hotplug script might want to know is passed in the environment. The final hotplug method (hotplug) gives an opportunity to add useful environment variables prior to the invocation of that script. Again, this method's prototype is:

int (*hotplug)(结构 kset *kset, 结构 kobject *kobj,
               char **envp,int num_envp,char *缓冲区,
               int缓冲区大小);
int (*hotplug)(struct kset *kset, struct kobject *kobj, 
               char **envp, int num_envp, char *buffer, 
               int buffer_size);

像往常一样,kset描述kobject正在生成事件的对象。数组 envp是存储附加环境变量定义的地方(以通常的NAME=value格式);它有num_envp可用的条目。变量本身应编码为buffer,其 buffer_size长度为字节。如果向 中添加任何变量 envp,请务必在最后一次添加后添加一个NULL条目,以便内核知道结尾在哪里。返回值通常应该是0; 任何非零返回都会中止热插拔事件的生成。

As usual, kset and kobject describe the object for which the event is being generated. The envp array is a place to store additional environment variable definitions (in the usual NAME=value format); it has num_envp entries available. The variables themselves should be encoded into buffer, which is buffer_size bytes long. If you add any variables to envp, be sure to add a NULL entry after your last addition so that the kernel knows where the end is. The return value should normally be 0; any nonzero return aborts the generation of the hotplug event.

热插拔事件的生成(就像设备模型中的许多工作一样)通常由总线驱动程序级别的逻辑处理。

The generation of hotplug events (like much of the work in the device model) is usually handled by logic at the bus driver level.

总线、设备和驱动程序

Buses, Devices, and Drivers

到目前为止,我们看到了大量低水平的基础设施,而且例子相对缺乏。当我们进入更高级别的 Linux 设备模型时,我们将尝试在本章的其余部分中弥补这一点。为此,我们引入了一个新的虚拟总线,我们将其称为lddbus [ 1 ]并修改scullp驱动程序以“连接”到该总线。

So far, we have seen a great deal of low-level infrastructures and a relative shortage of examples. We try to make up for that in the rest of this chapter as we get into the higher levels of the Linux device model. To that end, we introduce a new virtual bus, which we call lddbus,[1] and modify the scullp driver to "connect" to that bus.

再次强调,许多驱动程序作者永远不需要这里介绍的大部分材料。该级别的细节通常在总线级别处理,很少有作者需要添加新的总线类型。然而,对于任何想知道 PCI、USB 等层内部发生了什么或需要在该级别进行更改的人来说,此信息很有用。

Once again, much of the material covered here will never be needed by many driver authors. Details at this level are generally handled at the bus level, and few authors need to add a new bus type. This information is useful, however, for anybody wondering what is happening inside the PCI, USB, etc. layers or who needs to make changes at that level.

巴士

Buses

总线是一个通道 处理器和一个或多个设备之间。就设备模型而言,所有设备都通过总线连接,即使它是内部虚拟“平台”总线。总线可以相互插入——例如,USB 控制器通常是 PCI 设备。设备模型代表总线与其控制的设备之间的实际连接。

A bus is a channel between the processor and one or more devices. For the purposes of the device model, all devices are connected via a bus, even if it is an internal, virtual, "platform" bus. Buses can plug into each other—a USB controller is usually a PCI device, for example. The device model represents the actual connections between buses and the devices they control.

在Linux设备模型中,总线由bus_type 结构体,在<linux/device.h>中定义。这个结构看起来像:

In the Linux device model, a bus is represented by the bus_type structure, defined in <linux/device.h>. This structure looks like:

结构体总线类型{
    字符*名称;
    结构子系统 subsys;
    结构 kset 驱动程序;
    结构 kset 设备;
    int (*match)(struct device *dev, struct device_driver *drv);
    struct device *(*add)(struct device *parent, char *bus_id);
    int (*hotplug) (结构设备 *dev, char **envp,
                    int num_envp, char *buffer, int buffer_size);
    /* 省略部分字段 */
};
struct bus_type {
    char *name;
    struct subsystem subsys;
    struct kset drivers;
    struct kset devices;
    int (*match)(struct device *dev, struct device_driver *drv);
    struct device *(*add)(struct device * parent, char * bus_id);
    int (*hotplug) (struct device *dev, char **envp, 
                    int num_envp, char *buffer, int buffer_size);
    /* Some fields omitted */
};

name field 是总线的名称,例如pci. 从结构中可以看出,每条总线都是自己的子系统;然而,这些子系统并不位于 sysfs 的顶层。相反,它们位于bus子系统下方。总线包含两个 kset,代表该总线的已知驱动程序以及插入总线的所有设备。然后,我们很快就会介绍一组方法。

The name field is the name of the bus, something such as pci. You can see from the structure that each bus is its own subsystem; these subsystems do not live at the top level in sysfs, however. Instead, they are found underneath the bus subsystem. A bus contains two ksets, representing the known drivers for that bus and all devices plugged into the bus. Then, there is a set of methods that we will get to shortly.

巴士登记

Bus registration

正如我们提到的, 示例源包括一个名为lddbus的虚拟总线实现。该总线的结构bus_type如下:

As we mentioned, the example source includes a virtual bus implementation called lddbus. This bus sets up its bus_type structure as follows:

结构体总线类型 ldd_bus_type = {
    .name = "ldd",
    .match = ldd_match,
    .hotplug = ldd_hotplug,
};
struct bus_type ldd_bus_type = {
    .name = "ldd",
    .match = ldd_match,
    .hotplug  = ldd_hotplug,
};

请注意,很少有bus_type字段需要初始化;其中大部分由设备模型核心处理。但是,我们必须指定总线的名称以及与其相关的任何方法。

Note that very few of the bus_type fields require initialization; most of that is handled by the device model core. We do have to specify the name of the bus, however, and any methods that go along with it.

不可避免地,新总线必须通过调用 bus_register向系统注册 lddbus代码 是这样实现的:

Inevitably, a new bus must be registered with the system via a call to bus_register . The lddbus code does so in this way:

ret = 总线_寄存器(&ldd_bus_type);
如果(返回)
    返回ret;
ret = bus_register(&ldd_bus_type);
if (ret)
    return ret;

当然,此调用可能会失败,因此必须始终检查返回值。如果成功,则新的总线子系统已添加到系统中;它在/sys/bus下的 sysfs 中可见,并且可以开始添加设备。

This call can fail, of course, so the return value must always be checked. If it succeeds, the new bus subsystem has been added to the system; it is visible in sysfs under /sys/bus, and it is possible to start adding devices.

如果需要从系统中删除总线(例如,当关联的模块被删除时),应该调用bus_unregister :

Should it be necessary to remove a bus from the system (when the associated module is removed, for example), bus_unregister should be called:

无效bus_unregister(结构bus_type *总线);
void bus_unregister(struct bus_type *bus);

巴士方式

Bus methods

有几种方法 为结构定义bus_type;它们允许总线代码充当设备核心和各个驱动程序之间的中介。2.6.10内核中定义的方法有:

There are several methods defined for the bus_type structure; they allow the bus code to serve as an intermediary between the device core and individual drivers. The methods defined in the 2.6.10 kernel are:

int (*match)(struct device *device, struct device_driver *driver);
int (*match)(struct device *device, struct device_driver *driver);

每当为此总线添加新设备或驱动程序时,可能会多次调用此方法。device如果给定值可以由给定值处理,则它应该返回一个非零值driverdevice(我们很快就会了解和结构的详细信息device_driver)。该功能必须在总线级别处理,因为这是正确逻辑存在的地方;核心内核无法知道如何为每种可能的总线类型匹配设备和驱动程序。

This method is called, perhaps multiple times, whenever a new device or driver is added for this bus. It should return a nonzero value if the given device can be handled by the given driver. (We get to the details of the device and device_driver structures shortly). This function must be handled at the bus level, because that is where the proper logic exists; the core kernel cannot know how to match devices and drivers for every possible bus type.

int (*hotplug) (struct device *device, char **envp, int num_envp, char

*buffer, int buffer_size);
int (*hotplug) (struct device *device, char **envp, int num_envp, char

*buffer, int buffer_size);

此方法允许总线在用户空间中生成热插拔事件之前将变量添加到环境中。这些参数与 kset hotplug方法相同(在前面的14.3 节中描述)。

This method allows the bus to add variables to the environment prior to the generation of a hotplug event in user space. The parameters are the same as for the kset hotplug method (described in the earlier Section 14.3).

lddbus驱动程序有一个非常简单的匹配 函数,它简单地比较驱动程序和设备名称:

The lddbus driver has a very simple match function, which simply compares the driver and device names:

static int ldd_match(struct device *dev, struct device_driver *driver)
{
    return !strncmp(dev->bus_id, 驱动程序->名称, strlen(驱动程序->名称));
}
static int ldd_match(struct device *dev, struct device_driver *driver)
{
    return !strncmp(dev->bus_id, driver->name, strlen(driver->name));
}

当涉及真实硬件时,匹配函数通常会在设备本身提供的硬件ID和驱动程序支持的ID之间进行某种比较。

When real hardware is involved, the match function usually makes some sort of comparison between the hardware ID provided by the device itself and the IDs supported by the driver.

lddbus热插拔方法如下所示

The lddbus hotplug method looks like this:

静态 int ldd_hotplug(结构设备 *dev, char **envp, int num_envp,
        char *buffer, int buffer_size)
{
    envp[0] = 缓冲区;
    if (snprintf(缓冲区, buffer_size, "LDDBUS_VERSION=%s",
                版本) >= buffer_size)
        返回-ENOMEM;
    envp[1] = NULL;
    返回0;
}
static int ldd_hotplug(struct device *dev, char **envp, int num_envp,
        char *buffer, int buffer_size)
{
    envp[0] = buffer;
    if (snprintf(buffer, buffer_size, "LDDBUS_VERSION=%s",
                Version) >= buffer_size)
        return -ENOMEM;
    envp[1] = NULL;
    return 0;
}

在这里,我们添加了lddbus 源代码的当前修订号,以防万一有人好奇。

Here, we add in the current revision number of the lddbus source, just in case anybody is curious.

迭代设备和驱动程序

Iterating over devices and drivers

如果您正在编写总线级代码, 您可能会发现自己必须对已在总线上注册的所有设备或驱动程序执行某些操作。直接深入研究结构中的结构可能很诱人 bus_type,但最好使用已提供的辅助函数。

If you are writing bus-level code, you may find yourself having to perform some operation on all devices or drivers that have been registered with your bus. It may be tempting to dig directly into the structures in the bus_type structure, but it is better to use the helper functions that have been provided.

要对总线已知的每个设备进行操作,请使用:

To operate on every device known to the bus, use:

intbus_for_each_dev(结构bus_type *总线,结构设备*开始,
                     void *data, int (*fn)(struct device *, void *));
int bus_for_each_dev(struct bus_type *bus, struct device *start, 
                     void *data, int (*fn)(struct device *, void *));

该函数迭代 上的每个设备bus,将关联的device结构fn以及作为 传入的值传递给data。如果startNULL,则迭代从总线上的第一个设备开始;否则迭代从 后的第一个设备开始start。如果返回非零值,则迭代停止并且该值从bus_for_each_devfn返回

This function iterates over every device on bus, passing the associated device structure to fn, along with the value passed in as data. If start is NULL, the iteration begins with the first device on the bus; otherwise iteration starts with the first device after start. If fn returns a nonzero value, iteration stops and that value is returned from bus_for_each_dev .

有一个类似的函数用于迭代驱动程序:

There is a similar function for iterating over drivers:

intbus_for_each_drv(结构bus_type *总线,结构device_driver *开始,
                     void *data, int (*fn)(struct device_driver *, void *));
int bus_for_each_drv(struct bus_type *bus, struct device_driver *start, 
                     void *data, int (*fn)(struct device_driver *, void *));

该函数的工作方式与bus_for_each_dev类似,当然,它与驱动程序一起工作。

This function works just like bus_for_each_dev, except, of course, that it works with drivers instead.

应该注意的是,这两个函数在工作期间都保留总线子系统的读取器/写入器信号量。因此,尝试同时使用两者将会陷入僵局——每个人都将尝试获取相同的信号量。修改总线的操作(例如取消注册设备)也会锁定。因此,请 谨慎使用bus_for_each函数。

It should be noted that both of these functions hold the bus subsystem's reader/writer semaphore for the duration of the work. So an attempt to use the two of them together will deadlock—each will be trying to obtain the same semaphore. Operations that modify the bus (such as unregistering devices) will also lock up. So, use the bus_for_each functions with some care.

总线属性

Bus attributes

Linux 中的几乎每一层 设备模型提供了用于添加属性的接口,总线层也不例外。这bus_attribute 类型在<linux/device.h>中定义如下:

Almost every layer in the Linux device model provides an interface for the addition of attributes, and the bus layer is no exception. The bus_attribute type is defined in <linux/device.h> as follows:

结构体总线属性{
    结构体属性attr;
    ssize_t (*show)(struct bus_type *bus, char *buf);
    ssize_t (*store)(结构bus_type *bus, const char *buf,
                     size_t 计数);
};
struct bus_attribute {
    struct attribute attr;
    ssize_t (*show)(struct bus_type *bus, char *buf);
    ssize_t (*store)(struct bus_type *bus, const char *buf, 
                     size_t count);
};

我们已经struct attribute第 14.2.1 节中看到了。该bus_attribute类型还包括两种用于显示和设置属性值的方法。kobject 级别之上的大多数设备模型层都以这种方式工作。

We have already seen struct attribute in Section 14.2.1. The bus_attribute type also includes two methods for displaying and setting the value of the attribute. Most device model layers above the kobject level work this way.

为结构的编译时创建和初始化提供了一个方便的宏bus_attribute

A convenience macro has been provided for the compile-time creation and initialization of bus_attribute structures:

BUS_ATTR(名称、模式、显示、存储);
BUS_ATTR(name, mode, show, store);

该宏声明一个结构,通过将字符串添加bus_attr_到给定的name.

This macro declares a structure, generating its name by prepending the string bus_attr_ to the given name.

属于总线的任何属性都应该使用 bus_create_file显式创建:

Any attributes belonging to a bus should be created explicitly with bus_create_file:

intbus_create_file(structbus_type*bus, structbus_attribute*attr);
int bus_create_file(struct bus_type *bus, struct bus_attribute *attr);

属性还可以 删除:

Attributes can also be removed with:

无效bus_remove_file(结构bus_type *总线,结构bus_attribute * attr);
void bus_remove_file(struct bus_type *bus, struct bus_attribute *attr);

lddbus驱动程序 创建一个简单的属性文件,其中再次包含源版本号。显示方法bus_attribute结构设置如下:

The lddbus driver creates a simple attribute file containing, once again, the source version number. The show method and bus_attribute structure are set up as follows:

static ssize_t show_bus_version(结构bus_type *bus, char *buf)
{
    return snprintf(buf, PAGE_SIZE, "%s\n", 版本);
}

静态 BUS_ATTR(版本,S_IRUGO,show_bus_version,NULL);
static ssize_t show_bus_version(struct bus_type *bus, char *buf)
{
    return snprintf(buf, PAGE_SIZE, "%s\n", Version);
}

static BUS_ATTR(version, S_IRUGO, show_bus_version, NULL);

创建属性文件是在模块加载时完成的:

Creating the attribute file is done at module load time:

if (bus_create_file(&ldd_bus_type, &bus_attr_version))
    printk(KERN_NOTICE "无法创建版本属性\n");
if (bus_create_file(&ldd_bus_type, &bus_attr_version))
    printk(KERN_NOTICE "Unable to create version attribute\n");

这个调用创建了一个 属性文件 ( /sys/bus/ldd/version ) 包含lddbus代码的修订号 。

This call creates an attribute file (/sys/bus/ldd/version) containing the revision number for the lddbus code.

设备

Devices

在最低级别,每个设备 在 Linux 系统中由以下实例表示struct device

At the lowest level, every device in a Linux system is represented by an instance of struct device:

结构设备{
    结构设备*父级;
    结构体 kobject kobj;
    char 总线_id[BUS_ID_SIZE];
    结构总线_类型*总线;
    结构设备驱动程序*驱动程序;
    无效*驱动程序数据;
    void (*release)(struct device *dev);
    /* 省略几个字段 */
};
struct device {
    struct device *parent;
    struct kobject kobj;
    char bus_id[BUS_ID_SIZE];
    struct bus_type *bus;
    struct device_driver *driver;
    void *driver_data;
    void (*release)(struct device *dev);
    /* Several fields omitted */
};

还有很多其他的struct device 仅设备核心代码感兴趣的字段。然而,这些领域值得了解:

There are many other struct device fields that are of interest only to the device core code. These fields, however, are worth knowing about:

struct device *parent
struct device *parent

该设备的“父”设备——它所连接的设备。在大多数情况下,父设备是某种总线或主机控制器。如果parentNULL,则该设备是顶级设备,这通常不是您想要的。

The device's "parent" device—the device to which it is attached. In most cases, a parent device is some sort of bus or host controller. If parent is NULL, the device is a top-level device, which is not usually what you want.

struct kobject kobj;
struct kobject kobj;

代表该设备并将其链接到层次结构中的 kobject。请注意,作为一般规则,device->kobj->parent等于&device->parent->kobj

The kobject that represents this device and links it into the hierarchy. Note that, as a general rule, device->kobj->parent is equal to &device->parent->kobj.

char bus_id[BUS_ID_SIZE];
char bus_id[BUS_ID_SIZE];

总线上唯一标识该设备的字符串。例如,PCI 设备使用标准 PCI ID 格式,其中包含域、总线、设备和功能号。

A string that uniquely identifies this device on the bus. PCI devices, for example, use the standard PCI ID format containing the domain, bus, device, and function numbers.

struct bus_type *bus;
struct bus_type *bus;

识别设备所在的总线类型。

Identifies which kind of bus the device sits on.

struct device_driver *driver;
struct device_driver *driver;

管理该设备的驱动程序;我们将struct device_driver在下一节中进行研究。

The driver that manages this device; we examine struct device_driver in the next section.

void *driver_data;
void *driver_data;

设备驱动程序可以使用的私有数据字段。

A private data field that may be used by the device driver.

void (*release)(struct device *dev);
void (*release)(struct device *dev);

当对设备的最后一个引用被删除时,调用该方法;它是从嵌入的 kobject 的release方法中调用的。所有device向内核注册的结构都必须有一个 释放方法,否则内核会打印出可怕的抱怨。

The method is called when the last reference to the device is removed; it is called from the embedded kobject's release method. All device structures registered with the core must have a release method, or the kernel prints out scary complaints.

在注册设备结构之前,至少必须设置parentbus_idbus和字段。release

At a minimum, the parent, bus_id, bus, and release fields must be set before the device structure can be registered.

设备注册

Device registration

通常的注册设置 存在注销功能:

The usual set of registration and unregistration functions exists:

int device_register(struct device *dev);
void device_unregister(struct device *dev);
int device_register(struct device *dev);
void device_unregister(struct device *dev);

我们已经了解了lddbus代码如何注册其总线类型。然而,实际的总线是一个设备,必须单独注册。为简单起见,lddbus模块仅支持单个虚拟总线,因此驱动程序在编译时设置其设备:

We have seen how the lddbus code registers its bus type. However, an actual bus is a device and must be registered separately. For simplicity, the lddbus module supports only a single virtual bus, so the driver sets up its device at compile time:

静态无效ldd_bus_release(结构设备* dev)
{
    printk(KERN_DEBUG "lddbus 版本\n");
}
    
结构设备ldd_bus = {
    .bus_id =“ldd0”,
    .release = ldd_bus_release
};
static void ldd_bus_release(struct device *dev)
{
    printk(KERN_DEBUG "lddbus release\n");
}
    
struct device ldd_bus = {
    .bus_id   = "ldd0",
    .release  = ldd_bus_release
};

这是顶级总线,因此保留parent和 字段。我们有一个简单的、无操作的释放方法,并且,作为第一个(也是唯一一个)总线,它的名称是。该总线设备注册到:busNULLldd0

This is a top-level bus, so the parent and bus fields are left NULL. We have a simple, no-op release method, and, as the first (and only) bus, its name is ldd0. This bus device is registered with:

ret = device_register(&ldd_bus);
如果(返回)
    printk(KERN_NOTICE "无法注册 ldd0\n");
ret = device_register(&ldd_bus);
if (ret)
    printk(KERN_NOTICE "Unable to register ldd0\n");

调用完成后,可以在 sysfs 中的/sys/devices下看到新总线。添加到该总线的任何设备都会显示在 /sys/devices/ldd0/下。

Once that call is complete, the new bus can be seen under /sys/devices in sysfs. Any devices added to this bus then shows up under /sys/devices/ldd0/.

设备属性

Device attributes

设备条目 在 sysfs 中可以有属性。相关结构为:

Device entries in sysfs can have attributes. The relevant structure is:

结构设备属性{
    结构体属性attr;
    ssize_t (*show)(struct device *dev, char *buf);
    ssize_t (*store)(struct device *dev, const char *buf,
                     size_t 计数);
};
struct device_attribute {
    struct attribute attr;
    ssize_t (*show)(struct device *dev, char *buf);
    ssize_t (*store)(struct device *dev, const char *buf, 
                     size_t count);
};

这些属性结构可以在编译时使用以下宏设置:

These attribute structures can be set up at compile time with this macro:

DEVICE_ATTR(名称、模式、显示、存储);
DEVICE_ATTR(name, mode, show, store);

生成的结构通过添加dev_attr_到给定的name. 属性文件的实际管理是通过常用的一对函数来处理的:

The resulting structure is named by prepending dev_attr_ to the given name. The actual management of attribute files is handled with the usual pair of functions:

int device_create_file(结构设备 *device,
                       结构设备属性*条目);
无效device_remove_file(结构设备* dev,
                        结构设备属性*attr);
int device_create_file(struct device *device, 
                       struct device_attribute *entry);
void device_remove_file(struct device *dev, 
                        struct device_attribute *attr);

字段dev_attrs指向struct bus_type为添加到该总线的每个设备创建的默认属性列表。

The dev_attrs field of struct bus_type points to a list of default attributes created for every device added to that bus.

设备结构嵌入

Device structure embedding

结构device_ 包含设备模型核心对系统进行建模所需的信息。然而,大多数子系统都会跟踪有关其托管设备的附加信息。因此,很少有设备能够用裸露的 device结构来表示;相反,该结构(如 kobject 结构)通常嵌入到设备的更高级别表示中。struct pci_dev如果你查看or的定义struct usb_device,你会发现struct device里面埋藏着一个。通常,低级驱动程序甚至不知道这一点struct device,但也可能有例外。

The device structure contains the information that the device model core needs to model the system. Most subsystems, however, track additional information about the devices they host. As a result, it is rare for devices to be represented by bare device structures; instead, that structure, like kobject structures, is usually embedded within a higher-level representation of the device. If you look at the definitions of struct pci_dev or struct usb_device, you will find a struct device buried inside. Usually, low-level drivers are not even aware of that struct device, but there can be exceptions.

lddbus驱动程序创建自己的设备类型 ( struct ldd_device) 并期望各个设备驱动程序使用该类型注册其设备。这是一个简单的结构:

The lddbus driver creates its own device type (struct ldd_device) and expects individual device drivers to register their devices using that type. It is a simple structure:

结构ldd_device {
    字符*名称;
    结构ldd_driver *驱动程序;
    结构设备开发;
};

#define to_ldd_device(dev)container_of(dev, struct ldd_device, dev);
struct ldd_device {
    char *name;
    struct ldd_driver *driver;
    struct device dev;
};

#define to_ldd_device(dev) container_of(dev, struct ldd_device, dev);

此结构允许驱动程序提供设备的实际名称(可以与其存储在结构中的总线 ID 不同device )和指向驱动程序信息的指针。真实设备的结构通常还包含有关供应商、设备型号、设备配置、使用的资源等信息。struct pci_dev好的例子可以在( <linux/pci.h> ) 或struct usb_device( <linux/usb.h> )中找到。还定义了一个方便的宏(to_ldd_devicestruct ldd_device ) ,以便轻松地将指向嵌入device结构的指针转换为ldd_device指针。

This structure allows the driver to provide an actual name for the device (which can be distinct from its bus ID, stored in the device structure) and a pointer to driver information. Structures for real devices usually also contain information about the vendor, device model, device configuration, resources used, and so on. Good examples can be found in struct pci_dev (<linux/pci.h>) or struct usb_device (<linux/usb.h>). A convenience macro (to_ldd_device) is also defined for struct ldd_device to make it easy to turn pointers to the embedded device structure into ldd_device pointers.

lddbus导出的注册接口如下所示:

The registration interface exported by lddbus looks like this:

int register_ldd_device(结构 ldd_device *ldddev)
{
    ldddev->dev.bus = &ldd_bus_type;
    ldddev->dev.parent = &ldd_bus;
    ldddev->dev.release = ldd_dev_release;
    strncpy(ldddev->dev.bus_id, ldddev->名称, BUS_ID_SIZE);
    返回 device_register(&ldddev->dev);
}
EXPORT_SYMBOL(register_ldd_device);
int register_ldd_device(struct ldd_device *ldddev)
{
    ldddev->dev.bus = &ldd_bus_type;
    ldddev->dev.parent = &ldd_bus;
    ldddev->dev.release = ldd_dev_release;
    strncpy(ldddev->dev.bus_id, ldddev->name, BUS_ID_SIZE);
    return device_register(&ldddev->dev);
}
EXPORT_SYMBOL(register_ldd_device);

在这里,我们只需填写一些嵌入的device结构字段(各个驱动程序不需要知道这些字段),并向驱动程序核心注册设备。如果我们想向设备添加特定于总线的属性,我们可以在此处执行此操作。

Here, we simply fill in some of the embedded device structure fields (which individual drivers should not need to know about), and register the device with the driver core. If we wanted to add bus-specific attributes to the device, we could do so here.

为了展示如何使用此接口,让我们介绍另一个示例驱动程序,我们将其称为sculld它是第 8 章中首次介绍的scullp驱动程序的另一个变体 。它实现了通常的内存区域设备,但sculld还通过lddbus接口与 Linux 设备模型配合使用。

To show how this interface is used, let us introduce another sample driver, which we have called sculld. It is yet another variant on the scullp driver first introduced in Chapter 8. It implements the usual memory area device, but sculld also works with the Linux device model by way of the lddbus interface.

sculld驱动程序将自己的属性添加到其设备条目中;该属性称为dev,仅包含关联的设备号。当设备添加到系统时,加载脚本或热插拔子系统的模块可以使用此属性来自动创建设备节点。该属性的设置遵循通常的模式:

The sculld driver adds an attribute of its own to its device entry; this attribute, called dev, simply contains the associated device number. This attribute could be used by a module loading the script or the hotplug subsystem to automatically create device nodes when the device is added to the system. The setup for this attribute follows the usual patterns:

static ssize_t sculld_show_dev(struct device *ddev, char *buf)
{
    struct sculld_dev *dev = ddev->driver_data;

    返回 print_dev_t(buf, dev->cdev.dev);
}

静态DEVICE_ATTR(dev,S_IRUGO,sculld_show_dev,NULL);
static ssize_t sculld_show_dev(struct device *ddev, char *buf)
{
    struct sculld_dev *dev = ddev->driver_data;

    return print_dev_t(buf, dev->cdev.dev);
}

static DEVICE_ATTR(dev, S_IRUGO, sculld_show_dev, NULL);

然后,在初始化时,注册设备,并dev通过以下函数创建属性:

Then, at initialization time, the device is registered, and the dev attribute is created through the following function:

静态无效 sculld_register_dev(结构 sculld_dev *dev, int 索引)
{
    sprintf(dev->devname, "sculld%d", 索引);
    dev->ldev.name = dev->devname;
    dev->ldev.driver = &sculld_driver;
    dev->ldev.dev.driver_data = dev;
    register_ldd_device(&dev->ldev);
    device_create_file(&dev->ldev.dev, &dev_attr_dev);
}
static void sculld_register_dev(struct sculld_dev *dev, int index)
{
    sprintf(dev->devname, "sculld%d", index);
    dev->ldev.name = dev->devname;
    dev->ldev.driver = &sculld_driver;
    dev->ldev.dev.driver_data = dev;
    register_ldd_device(&dev->ldev);
    device_create_file(&dev->ldev.dev, &dev_attr_dev);
}

请注意,我们使用该driver_data字段来存储指向我们自己的内部设备结构的指针。

Note that we make use of the driver_data field to store the pointer to our own, internal device structure.

设备驱动程序

Device Drivers

设备型号 跟踪系统已知的所有驱动程序。这种跟踪的主要原因是使驱动程序核心能够将驱动程序与新设备进行匹配。然而,一旦驱动程序成为系统内的已知对象,许多其他事情就变得可能。例如,设备驱动程序可以导出独立于任何特定设备的信息和配置变量。

The device model tracks all of the drivers known to the system. The main reason for this tracking is to enable the driver core to match up drivers with new devices. Once drivers are known objects within the system, however, a number of other things become possible. Device drivers can export information and configuration variables that are independent of any specific device, for example.

驱动程序由以下结构定义:

Drivers are defined by the following structure:

结构设备驱动程序{
    字符*名称;
    结构总线_类型*总线;
    结构体 kobject kobj;
    struct list_head 设备;
    int (*probe)(结构设备*dev);
    int (*remove)(struct device *dev);
    void (*shutdown) (struct device *dev);
};
struct device_driver {
    char *name;
    struct bus_type *bus;
    struct kobject kobj;
    struct list_head devices;
    int (*probe)(struct device *dev);
    int (*remove)(struct device *dev);
    void (*shutdown) (struct device *dev);
};

再次省略了该结构的几个字段(请参阅<linux/device.h>了解完整情况)。这里,name是驱动程序的名称(它显示在sysfs中), bus是该驱动程序使用的总线类型, kobj是不可避免的kobject,devices是当前绑定到该驱动程序的所有设备的列表,probe是调用查询的函数特定设备的存在(以及该驱动程序是否可以使用它),remove在设备从系统中删除时调用,并shutdown在关闭时调用以使设备停顿。

Once again, several of the structure's fields have been omitted (see <linux/device.h> for the full story). Here, name is the name of the driver (it shows up in sysfs), bus is the type of bus this driver works with, kobj is the inevitable kobject, devices is a list of all devices currently bound to this driver, probe is a function called to query the existence of a specific device (and whether this driver can work with it), remove is called when the device is removed from the system, and shutdown is called at shutdown time to quiesce the device.

device_driver现在,用于处理结构的函数的形式应该看起来很熟悉(因此我们很快就会介绍它们)。注册功能有:

The form of the functions for working with device_driver structures should be looking familiar by now (so we cover them very quickly). The registration functions are:

int driver_register(struct device_driver *drv);
无效 driver_unregister(struct device_driver *drv);
int driver_register(struct device_driver *drv);
void driver_unregister(struct device_driver *drv);

通常的属性结构存在:

The usual attribute structure exists:

结构驱动程序属性{
    结构体属性attr;
    ssize_t (*show)(struct device_driver *drv, char *buf);
    ssize_t (*store)(struct device_driver *drv, const char *buf,
                     size_t 计数);
};
DRIVER_ATTR(名称、模式、显示、存储);
struct driver_attribute {
    struct attribute attr;
    ssize_t (*show)(struct device_driver *drv, char *buf);
    ssize_t (*store)(struct device_driver *drv, const char *buf, 
                     size_t count);
};
DRIVER_ATTR(name, mode, show, store);

属性文件以通常的方式创建:

And attribute files are created in the usual way:

int driver_create_file(struct device_driver *drv,
                       结构 driver_attribute *attr);
无效 driver_remove_file(struct device_driver *drv,
                        结构 driver_attribute *attr);
int driver_create_file(struct device_driver *drv, 
                       struct driver_attribute *attr);
void driver_remove_file(struct device_driver *drv, 
                        struct driver_attribute *attr);

该结构包含一个指向一组默认属性的bus_type字段 ( ),这些属性是为与该总线关联的所有驱动程序创建的。drv_attrs

The bus_type structure contains a field (drv_attrs) that points to a set of default attributes, which are created for all drivers associated with that bus.

驱动程序结构嵌入

Driver structure embedding

和mos的情况一样 在驱动程序核心结构中,该device_driver结构通常嵌入在更高级别的特定于总线的结构中。lddbus子系统 绝不会违背这样的趋势,所以它定义了自己的ldd_driver 结构:

As is the case with mos t driver core structures, the device_driver structure is usually found embedded within a higher-level, bus-specific structure. The lddbus subsystem would never go against such a trend, so it has defined its own ldd_driver structure:

结构ldd_driver {
    字符*版本;
    结构模块*模块;
    struct device_driver 驱动程序;
    结构 driver_attribute version_attr;
};

#define to_ldd_driver(drv)container_of(drv, struct ldd_driver, driver);
struct ldd_driver {
    char *version;
    struct module *module;
    struct device_driver driver;
    struct driver_attribute version_attr;
};

#define to_ldd_driver(drv) container_of(drv, struct ldd_driver, driver);

在这里,我们要求每个驱动程序提供其当前的软件版本,并且 lddbus为其知道的每个驱动程序导出该版本字符串。总线特定的驱动程序注册函数是:

Here, we require each driver to provide its current software version, and lddbus exports that version string for every driver it knows about. The bus-specific driver registration function is:

int register_ldd_driver(struct ldd_driver *driver)
{
    int ret;
    
    驱动程序->driver.bus = &ldd_bus_type;
    ret = driver_register(&驱动程序->驱动程序);
    如果(返回)
        返回ret;
    驱动程序->version_attr.attr.name = "版本";
    驱动程序->version_attr.attr.owner = 驱动程序->模块;
    驱动程序->version_attr.attr.mode = S_IRUGO;
    驱动程序->version_attr.show = show_version;
    驱动程序->version_attr.store = NULL;
    return driver_create_file(&driver->driver, &driver->version_attr);
}
int register_ldd_driver(struct ldd_driver *driver)
{
    int ret;
    
    driver->driver.bus = &ldd_bus_type;
    ret = driver_register(&driver->driver);
    if (ret)
        return ret;
    driver->version_attr.attr.name = "version";
    driver->version_attr.attr.owner = driver->module;
    driver->version_attr.attr.mode = S_IRUGO;
    driver->version_attr.show = show_version;
    driver->version_attr.store = NULL;
    return driver_create_file(&driver->driver, &driver->version_attr);
}

该函数的前半部分只是简单地将低级device_driver结构注册到内核;其余的设置 version属性。由于该属性是在运行时创建的,因此我们不能使用DRIVER_ATTR 宏观;相反,driver_attribute必须手动填写该结构。请注意,我们将属性的所有者设置为驱动程序模块,而不是lddbus模块;其原因可以从节目的实施中看出 该属性的函数:

The first half of the function simply registers the low-level device_driver structure with the core; the rest sets up the version attribute. Since this attribute is created at runtime, we can't use the DRIVER_ATTR macro; instead, the driver_attribute structure must be filled in by hand. Note that we set the owner of the attribute to the driver module, rather than the lddbus module; the reason for this can be seen in the implementation of the show function for this attribute:

static ssize_t show_version(struct device_driver *driver, char *buf)
{
    struct ldd_driver *ldriver = to_ldd_driver(驱动程序);

    sprintf(buf, "%s\n", ldriver->版本);
    返回 strlen(buf);
}
static ssize_t show_version(struct device_driver *driver, char *buf)
{
    struct ldd_driver *ldriver = to_ldd_driver(driver);

    sprintf(buf, "%s\n", ldriver->version);
    return strlen(buf);
}

人们可能会认为属性所有者应该是lddbus 模块,因为实现该属性的函数是在那里定义的。然而,该函数正在使用ldd_driver 驱动程序本身创建(和拥有)的结构。如果当用户空间进程尝试读取版本号时该结构消失,事情可能会变得混乱。将驱动程序模块指定为属性的所有者可防止模块被卸载,而用户空间则使属性文件保持打开状态。由于每个驱动程序模块都会创建对lddbus模块的引用,因此我们可以确定 lddbus不会在不合适的时间被卸载。

One might think that the attribute owner should be the lddbus module, since the function that implements the attribute is defined there. This function, however, is working with the ldd_driver structure created (and owned) by the driver itself. If that structure were to go away while a user-space process tried to read the version number, things could get messy. Designating the driver module as the owner of the attribute prevents the module from being unloaded, while user-space holds the attribute file open. Since each driver module creates a reference to the lddbus module, we can be sure that lddbus will not be unloaded at an inopportune time.

为了完整起见,sculld创建其ldd_driver结构如下:

For completeness, sculld creates its ldd_driver structure as follows:

静态结构 ldd_driver sculld_driver = {
    .version = "$修订版: 1.1 $",
    .module = 这个模块,
    .驱动程序= {
        .name = "sculld",
    },
};
static struct ldd_driver sculld_driver = {
    .version = "$Revision: 1.1 $",
    .module = THIS_MODULE,
    .driver = {
        .name = "sculld",
    },
};

对register_ldd_driver 的简单调用会将其添加到系统中。初始化完成后,即可获取驱动信息 在sysfs中看到:

A simple call to register_ldd_driver adds it to the system. Once initialization is complete, the driver information can be seen in sysfs:

$tree /sys/bus/ldd/drivers
/sys/总线/ldd/驱动程序
`--双桨
    |-- sculld0 -> ../../../../devices/ldd0/sculld0
    |-- sculld1 -> ../../../../devices/ldd0/sculld1
    |-- sculld2 -> ../../../../devices/ldd0/sculld2
    |-- sculld3 -> ../../../../devices/ldd0/sculld3
    `--版本
$ tree /sys/bus/ldd/drivers
/sys/bus/ldd/drivers
`-- sculld
    |-- sculld0 -> ../../../../devices/ldd0/sculld0
    |-- sculld1 -> ../../../../devices/ldd0/sculld1
    |-- sculld2 -> ../../../../devices/ldd0/sculld2
    |-- sculld3 -> ../../../../devices/ldd0/sculld3
    `-- version

课程

Classes

最终的设备模型 我们在本章中研究的概念是。类是设备的高级视图,它抽象出低级实现细节。驱动程序可能会看到 SCSI 磁盘或 ATA 磁盘,但在类级别,它们都只是磁盘。类允许用户空间根据设备的功能(而不是设备的连接方式或工作方式)来使用设备。

The final device model concept we examine in this chapter is the class. A class is a higher-level view of a device that abstracts out low-level implementation details. Drivers may see a SCSI disk or an ATA disk, but, at the class level, they are all simply disks. Classes allow user space to work with devices based on what they do, rather than how they are connected or how they work.

几乎所有类都显示在/sys/class下的 sysfs 中。因此,例如, 无论接口类型如何,所有网络接口都可以在/sys/class/net下找到。输入设备可以在/sys/class/input中找到,串行设备可以在/sys/class/tty中找到。一个例外是块设备,由于历史原因,可以在/sys/block下找到它 。

Almost all classes show up in sysfs under /sys/class. Thus, for example, all network interfaces can be found under /sys/class/net, regardless of the type of interface. Input devices can be found in /sys/class/input, and serial devices are in /sys/class/tty. The one exception is block devices, which can be found under /sys/block for historical reasons.

类成员资格通常由高级代码处理,不需要驱动程序的显式支持。当sbull驱动程序(参见第 16 章)创建虚拟磁盘设备时,它会自动出现在/sys/block中。snull网络驱动程序 (参见第 17 章)不必为其在 /sys/class/net中表示的接口执行任何特殊操作。然而,有时驱动程序最终会直接处理类。

Class membership is usually handled by high-level code without the need for explicit support from drivers. When the sbull driver (see Chapter 16) creates a virtual disk device, it automatically appears in /sys/block. The snull network driver (see Chapter 17) does not have to do anything special for its interfaces to be represented in /sys/class/net. There will be times, however, when drivers end up dealing with classes directly.

在许多情况下,类子系统是将信息导出到用户空间的最佳方式。当子系统创建一个类时,它完全拥有该类,因此无需担心哪个模块拥有在那里找到的属性。花费很少的时间在 sysfs 中更面向硬件的部分中徘徊,就会意识到它对于直接浏览来说可能是一个不友好的地方。用户更乐意在/sys/class/some-widget中查找信息,而不是在/sys/devices/pci0000:00/0000:00:10.0/usb2/2-0:1.0下查找信息。

In many cases, the class subsystem is the best way of exporting information to user space. When a subsystem creates a class, it owns the class entirely, so there is no need to worry about which module owns the attributes found there. It also takes very little time wandering around in the more hardware-oriented parts of sysfs to realize that it can be an unfriendly place for direct browsing. Users more happily find information in /sys/class/some-widget than under, say, /sys/devices/pci0000:00/0000:00:10.0/usb2/2-0:1.0.

驱动程序核心导出两个不同的接口来管理类。class_simple 例程旨在尽可能轻松地向系统添加新类它们的主要目的通常是公开包含设备编号的属性,以实现设备节点的自动创建。常规类接口更复杂,但也提供更多功能。我们从简单的版本开始。

The driver core exports two distinct interfaces for managing classes. The class_simple routines are designed to make it as easy as possible to add new classes to the system; their main purpose, usually, is to expose attributes containing device numbers to enable the automatic creation of device nodes. The regular class interface is more complex but offers more features as well. We start with the simple version.

class_simple 接口

The class_simple Interface

简单 接口的目的是非常易于使用,以至于没有人有任何借口不至少导出包含设备分配编号的属性。使用此接口只需几个函数调用,几乎不需要与 Linux 设备模型相关的常用样板。

The class_simple interface was intended to be so easy to use that nobody would have any excuse for not exporting, at a minimum, an attribute containing a device's assigned number. Using this interface is simply a matter of a couple of function calls, with little of the usual boilerplate associated with the Linux device model.

第一步是创建类本身。这是通过调用 class_simple_create来完成的:

The first step is to create the class itself. That is accomplished with a call to class_simple_create:

struct class_simple *class_simple_create(struct module *owner, char *name);
struct class_simple *class_simple_create(struct module *owner, char *name);

该函数使用给定的name. 当然,该操作可能会失败,因此 在继续之前应始终检查返回值(使用IS_ERR ,如第 11 章1.8 节所述)。

This function creates a class with the given name. The operation can fail, of course, so the return value should always be checked (using IS_ERR, described in the Section 1.8 in Chapter 11) before continuing.

一个简单的类可以通过以下方式销毁:

A simple class can be destroyed with:

void class_simple_destroy(struct class_simple *cs);
void class_simple_destroy(struct class_simple *cs);

创建简单类的真正目的是向其中添加设备;该任务是通过以下方式实现的:

The real purpose of creating a simple class is to add devices to it; that task is achieved with:

struct class_device *class_simple_device_add(struct class_simple *cs,
                                             dev_t devnum,
                                             结构设备*设备,
                                             const char *fmt, ...);
struct class_device *class_simple_device_add(struct class_simple *cs,
                                             dev_t devnum,
                                             struct device *device,
                                             const char *fmt, ...);

这里,cs是先前创建的简单类, devnum是分配的设备编号,devicestruct device表示该设备,其余参数是printk样式的格式字符串和用于创建设备名称的参数。此调用向包含一个属性 的类添加一个条目,dev该属性保存设备号。如果device参数不是NULL,则符号链接(称为device)指向/sys/devices下的设备条目。

Here, cs is the previously created simple class, devnum is the assigned device number, device is the struct device representing this device, and the remaining parameters are a printk-style format string and arguments to create the device name. This call adds an entry to the class containing one attribute, dev, which holds the device number. If the device parameter is not NULL, a symbolic link (called device) points to the device's entry under /sys/devices.

可以向设备条目添加其他属性。这只是使用 class_device_create_file的问题,我们将在下一节中与完整类子系统的其余部分讨论它。

It is possible to add other attributes to a device entry. It is just a matter of using class_device_create_file, which we discuss in the next section with the rest of the full class subsystem.

当设备来来去去时,类会生成热插拔事件。如果您的驱动程序需要向用户空间事件处理程序的环境添加变量,它可以使用以下命令设置热插拔回调:

Classes generate hotplug events when devices come and go. If your driver needs to add variables to the environment for the user-space event handler, it can set up a hotplug callback with:

int class_simple_set_hotplug(struct class_simple *cs,
                             int (*hotplug)(struct class_device *dev,
                                            char **envp,int num_envp,
                                            char *buffer, int buffer_size));
int class_simple_set_hotplug(struct class_simple *cs, 
                             int (*hotplug)(struct class_device *dev, 
                                            char **envp, int num_envp, 
                                            char *buffer, int buffer_size));

当您的设备消失时,应使用以下命令删除类条目:

When your device goes away, the class entry should be removed with:

void class_simple_device_remove(dev_t dev);
void class_simple_device_remove(dev_t dev);

注意,这里不需要class_simple_device_addclass_device返回的结构体 ;设备编号(当然应该是唯一的)就足够了。

Note that the class_device structure returned by class_simple_device_add is not needed here; the device number (which should certainly be unique) is sufficient.

完整的类接口

The Full Class Interface

class_simple接口就足够 满足许多需求,但有时需要更大的灵活性。以下讨论描述了如何使用class_simple所基于的完整类机制。它很简短:类函数和结构遵循与设备模型的其余部分相同的模式,因此这里没有什么真正的新内容。

The class_simple interface suffices for many needs, but sometimes more flexibility is required. The following discussion describes how to use the full class mechanism, upon which class_simple is based. It is brief: the class functions and structures follow the same patterns as the rest of the device model, so there is little that is truly new here.

管理班级

Managing classes

一个类是由一个定义的 的实例 struct class

A class is defined by an instance of struct class:

结构体类{
    字符*名称;
    struct class_attribute *class_attrs;
    struct class_device_attribute *class_dev_attrs;
    int (*hotplug)(struct class_device *dev, char **envp,
                   int num_envp, char *buffer, int buffer_size);
    void (*release)(struct class_device *dev);
    void (*class_release)(struct class *class);
    /* 省略部分字段 */
};
struct class {
    char *name;
    struct class_attribute *class_attrs;
    struct class_device_attribute *class_dev_attrs;
    int (*hotplug)(struct class_device *dev, char **envp, 
                   int num_envp, char *buffer, int buffer_size);
    void (*release)(struct class_device *dev);
    void (*class_release)(struct class *class);
    /* Some fields omitted */
};

每个类都需要一个唯一的name,这就是该类在/sys/class下的显示方式。注册类时,将创建NULL由 指向的(终止)数组中列出的所有属性。class_attrs添加到该类的每个设备还有一组默认属性;class_dev_attrs指向那些。有一个常见的热插拔功能,用于在生成事件时将变量添加到环境中。还有两种 释放方法:每当从类中删除设备时都会调用release ,而当类本身被释放时会调用class_release 。

Each class needs a unique name, which is how this class appears under /sys/class. When the class is registered, all of the attributes listed in the (NULL-terminated) array pointed to by class_attrs is created. There is also a set of default attributes for every device added to the class; class_dev_attrs points to those. There is the usual hotplug function for adding variables to the environment when events are generated. There are also two release methods: release is called whenever a device is removed from the class, while class_release is called when the class itself is released.

注册功能有:

The registration functions are:

int class_register(struct class *cls);
void class_unregister(struct class *cls);
int class_register(struct class *cls);
void class_unregister(struct class *cls);

此时,使用属性的界面不应让任何人感到惊讶:

The interface for working with attributes should not surprise anybody at this point:

结构体类属性{
    结构体属性attr;
    ssize_t (*show)(struct class *cls, char *buf);
    ssize_t (*store)(struct class *cls, const char *buf, size_t count);
};

CLASS_ATTR(名称、模式、显示、存储);

int class_create_file(结构类 *cls,
                      const struct class_attribute *attr);
void class_remove_file(struct class *cls,
                       const struct class_attribute *attr);
struct class_attribute {
    struct attribute attr;
    ssize_t (*show)(struct class *cls, char *buf);
    ssize_t (*store)(struct class *cls, const char *buf, size_t count);
};

CLASS_ATTR(name, mode, show, store);

int class_create_file(struct class *cls, 
                      const struct class_attribute *attr);
void class_remove_file(struct class *cls, 
                       const struct class_attribute *attr);

类设备

Class devices

的真正目的是 类将充当属于该类成员的设备的容器。成员由以下人员代表struct class_device

The real purpose of a class is to serve as a container for the devices that are members of that class. A member is represented by struct class_device:

结构类设备{
    结构体 kobject kobj;
    结构类*类;
    结构设备*dev;
    无效*class_data;
    char class_id[BUS_ID_SIZE];
};
struct class_device {
    struct kobject kobj;
    struct class *class;
    struct device *dev;
    void *class_data;
    char class_id[BUS_ID_SIZE];
};

class_id字段保存该设备在 sysfs 中出现的名称。该class指针应该指向持有该设备的类,并且dev应该指向关联的device结构。设置dev可选;如果是非NULL,则用于创建从类条目到/sys/devices下相应条目的符号链接,以便在用户空间中轻松找到设备条目。该类可以用来class_data保存私有指针。

The class_id field holds the name of this device as it appears in sysfs. The class pointer should point to the class holding this device, and dev should point to the associated device structure. Setting dev is optional; if it is non-NULL, it is used to create a symbolic link from the class entry to the corresponding entry under /sys/devices, making it easy to find the device entry in user space. The class can use class_data to hold a private pointer.

已经提供了常用的注册功能:

The usual registration functions have been provided:

int class_device_register(struct class_device *cd);
void class_device_unregister(struct class_device *cd);
int class_device_register(struct class_device *cd);
void class_device_unregister(struct class_device *cd);

类设备接口还允许重命名已注册的条目:

The class device interface also allows the renaming of an already registered entry:

int class_device_rename(struct class_device *cd, char *new_name);
int class_device_rename(struct class_device *cd, char *new_name);

类设备条目具有以下属性:

Class device entries have attributes:

结构类设备属性{
   结构体属性attr;
   ssize_t (*show)(struct class_device *cls, char *buf);
   ssize_t (*store)(struct class_device *cls, const char *buf,
                    size_t 计数);
};

CLASS_DEVICE_ATTR(名称、模式、显示、存储);

int class_device_create_file(struct class_device *cls,
                             const struct class_device_attribute *attr);
void class_device_remove_file(struct class_device *cls,
                              const struct class_device_attribute *attr);
struct class_device_attribute {
   struct attribute attr;
   ssize_t (*show)(struct class_device *cls, char *buf);
   ssize_t (*store)(struct class_device *cls, const char *buf, 
                    size_t count);
};

CLASS_DEVICE_ATTR(name, mode, show, store);

int class_device_create_file(struct class_device *cls, 
                             const struct class_device_attribute *attr);
void class_device_remove_file(struct class_device *cls, 
                              const struct class_device_attribute *attr);

class_dev_attrs注册类设备时,会 在类的字段中创建一组默认属性;class_device_create_file可用于创建附加属性。属性也可以添加到使用 class_simple接口创建的类设备中。

A default set of attributes, in the class's class_dev_attrs field, is created when the class device is registered; class_device_create_file may be used to create additional attributes. Attributes may also be added to class devices created with the class_simple interface.

类接口

Class interfaces

类子系统有一个 Linux 设备模型的其他部分中未发现的附加概念。这种机制称为 接口,但也许最好将其视为一种触发机制,可用于在设备进入或离开类时获取通知。

The class subsystem has an additional concept not found in other parts of the Linux device model. This mechanism is called an interface, but it is, perhaps, best thought of as a sort of trigger mechanism that can be used to get notification when devices enter or leave the class.

接口表示为:

An interface is represented by:

结构体类接口{
    结构类*类;
    int (*add) (struct class_device *cd);
    void (*remove) (struct class_device *cd);
};
struct class_interface {
    struct class *class;
    int (*add) (struct class_device *cd);
    void (*remove) (struct class_device *cd);
};

接口可以通过以下方式注册和取消注册:

Interfaces can be registered and unregistered with:

int class_interface_register(struct class_interface *intf);
void class_interface_unregister(struct class_interface *intf);
int class_interface_register(struct class_interface *intf);
void class_interface_unregister(struct class_interface *intf);

界面的功能很简单。每当将类设备添加到结构class中指定的位置时,就会调用class_interface接口的 add函数。该功能可以执行该设备所需的任何附加设置;此设置通常采用添加更多属性的形式,但其他应用程序也是可能的。当设备从类中删除时,将调用删除方法来执行任何所需的清理。

The functioning of an interface is straightforward. Whenever a class device is added to the class specified in the class_interface structure, the interface's add function is called. That function can perform any additional setup required for that device; this setup often takes the form of adding more attributes, but other applications are possible. When the device is removed from the class, the remove method is called to perform any required cleanup.

多种接口 可以注册一个班级。

Multiple interfaces can be registered for a class.

把它们放在一起

Putting It All Together

为了更好地理解 驱动程序模型的作用是什么,让我们逐步了解内核中设备生命周期的步骤。我们描述 PCI 子系统如何与驱动程序模型交互、如何添加和删除驱动程序的基本概念以及如何在系统中添加和删除设备。这些细节在具体描述 PCI 内核代码的同时,也适用于使用驱动程序核心来管理其驱动程序和设备的所有其他子系统。

To better understand what the driver model does, let us walk through the steps of a device's lifecycle within the kernel. We describe how the PCI subsystem interacts with the driver model, the basic concepts of how a driver is added and removed, and how a device is added and removed from the system. These details, while describing the PCI kernel code specifically, apply to all other subsystems that use the driver core to manage their drivers and devices.

PCI 内核、驱动程序内核和各个 PCI 驱动程序之间的交互相当复杂,如图14-2所示。

The interaction between the PCI core, driver core, and the individual PCI drivers is quite complex, as Figure 14-2 shows.

设备创建过程

图 14-3。设备创建过程

Figure 14-3. Device-creation process

添加设备

Add a Device

PCI子系统 声明一个名为 的单个struct bus_typepci_bus_type,它使用以下值进行初始化:

The PCI subsystem declares a single struct bus_type called pci_bus_type, which is initialized with the following values:

结构体bus_type pci_bus_type = {
    .name = "pci",
    .match = pci_bus_match,
    .hotplug = pci_hotplug,
    .挂起= pci_device_挂起,
    .resume = pci_device_resume,
    .dev_attrs = pci_dev_attrs,
};
struct bus_type pci_bus_type = {
    .name      = "pci",
    .match     = pci_bus_match,
    .hotplug   = pci_hotplug,
    .suspend   = pci_device_suspend,
    .resume    = pci_device_resume,
    .dev_attrs = pci_dev_attrs,
};

pci_bus_type 当通过调用bus_register将 PCI 子系统加载到内核中时,变量会在驱动程序内核中注册。当发生这种情况时,驱动程序核心会在/sys/bus/pci中创建一个 sysfs 目录,其中包含两个目录:devicesdrivers

This pci_bus_type variable is registered with the driver core when the PCI subsystem is loaded in the kernel with a call to bus_register. When that happens, the driver core creates a sysfs directory in /sys/bus/pci that consists of two directories: devices and drivers.

所有 PCI 驱动程序都必须定义一个struct pci_driver 变量,该变量定义该 PCI 驱动程序可以执行的不同功能(有关 PCI 子系统以及如何编写 PCI 驱动程序的更多信息,请参阅第 12 章)。该结构包含一个struct device_driver,然后在注册 PCI 驱动程序时由 PCI 核心初始化:

All PCI drivers must define a struct pci_driver variable that defines the different functions that this PCI driver can do (for more information about the PCI subsystem and how to write a PCI driver, see Chapter 12). That structure contains a struct device_driver that is then initialized by the PCI core when the PCI driver is registered:

/* 初始化公共驱动字段 */
drv->driver.name = drv->name;
drv->driver.bus = &pci_bus_type;
drv->driver.probe = pci_device_probe;
drv->driver.remove = pci_device_remove;
drv->driver.kobj.ktype = &pci_driver_kobj_type;
/* initialize common driver fields */
drv->driver.name = drv->name;
drv->driver.bus = &pci_bus_type;
drv->driver.probe = pci_device_probe;
drv->driver.remove = pci_device_remove;
drv->driver.kobj.ktype = &pci_driver_kobj_type;

此代码为驱动程序设置总线以指向PCI 内核内的函数pci_bus_type,并将探针删除函数指向 PCI 内核内的函数。ktype驱动程序的被 kobject设置为变量pci_driver_kobj_type,以便 PCI 驱动程序的属性文件正常工作。然后PCI核心向驱动核心注册PCI驱动:

This code sets up the bus for the driver to point to the pci_bus_type and points the probe and remove functions to point to functions within the PCI core. The ktype for the driver's kobject is set to the variable pci_driver_kobj_type, in order for the PCI driver's attribute files to work properly. Then the PCI core registers the PCI driver with the driver core:

/* 向核心注册 */
错误= driver_register(&drv->driver);
/* register with core */
error = driver_register(&drv->driver);

该驱动程序现在已准备好绑定到它支持的任何 PCI 设备。

The driver is now ready to be bound to any PCI devices it supports.

PCI 内核在实际与 PCI 总线通信的特定于体系结构的代码的帮助下,开始探测 PCI 地址空间,寻找所有 PCI 设备。当找到 PCI 设备时,PCI 核心会在内存中创建一个类型为 的新变量struct pci_dev。结构的一部分struct pci_dev如下所示:

The PCI core, with help from the architecture-specific code that actually talks to the PCI bus, starts probing the PCI address space, looking for all PCI devices. When a PCI device is found, the PCI core creates a new variable in memory of type struct pci_dev. A portion of the struct pci_dev structure looks like the following:

结构体 pci_dev {
    /* ... */
    无符号整型 devfn;
    未签名的卖空者;
    未签名的短设备;
    无符号短子系统供应商;
    无符号短子系统设备;
    无符号整型类;
    /* ... */
    结构 pci_driver *驱动程序;
    /* ... */
    结构设备开发;
    /* ... */
};
struct pci_dev {
    /* ... */
    unsigned int   devfn;
    unsigned short vendor;
    unsigned short device;
    unsigned short subsystem_vendor;
    unsigned short subsystem_device;
    unsigned int   class;
    /* ... */
    struct pci_driver *driver;
    /* ... */
    struct device dev;
    /* ... */
};

该 PCI 设备的总线特定字段由 PCI 内核初始化( devfnvendordevice、 和其他字段),并且struct device变量的parent变量被设置为该 PCI 设备所在的 PCI 总线设备。该bus变量设置为指向该pci_bus_type结构。然后根据从 PCI 设备读取的名称和 ID 设置name和变量。bus_id

The bus-specific fields of this PCI device are initialized by the PCI core (the devfn, vendor, device, and other fields), and the struct device variable's parent variable is set to the PCI bus device that this PCI device lives on. The bus variable is set to point at the pci_bus_type structure. Then the name and bus_id variables are set, depending on the name and ID that is read from the PCI device.

PCI 设备结构初始化后,设备通过调用以下命令向驱动程序核心注册:

After the PCI device structure is initialized, the device is registered with the driver core with a call to:

device_register(&dev->dev);
device_register(&dev->dev);

device_register函数中,驱动程序核心初始化了许多设备的字段,将设备的 kobject 注册到 kobject 核心(这会导致生成热插拔事件,但我们将在本章后面讨论),然后将设备添加到设备的父设备持有的设备列表。这样做是为了让所有设备都可以按正确的顺序遍历,并且始终知道每个设备位于设备层次结构中的位置。

Within the device_register function, the driver core initializes a number of the device's fields, registers the device's kobject with the kobject core (which causes a hotplug event to be generated, but we discuss that later in this chapter), and then adds the device to the list of devices that are held by the device's parent. This is done so that all devices can be walked in the proper order, always knowing where in the hierarchy of devices each one lives.

然后,该设备被添加到所有设备的总线特定列表中,在本例中为列表 pci_bus_type。然后遍历在总线上注册的所有驱动程序的列表,并为每个驱动程序调用总线的匹配函数,指定该设备。对于pci_bus_type总线,在设备提交给驱动核心之前,匹配函数被PCI核心设置为指向pci_bus_match函数。

The device is then added to the bus-specific list of all devices, in this example, the pci_bus_type list. Then the list of all drivers that are registered with the bus is walked, and the match function of the bus is called for every driver, specifying this device. For the pci_bus_type bus, the match function was set to point to the pci_bus_match function by the PCI core before the device was submitted to the driver core.

pci_bus_match函数将驱动程序核心传递给它的数据转换struct devicestruct pci_dev. 它还将后面转换struct device_driver为a struct pci_driver,然后查看设备和驱动程序的PCI设备特定信息,看看驱动程序是否声明它可以支持这种设备。如果匹配不成功,该函数将返回0到驱动程序核心,并且驱动程序核心将移至其列表中的下一个驱动程序。

The pci_bus_match function casts the struct device that was passed to it by the driver core, back into a struct pci_dev. It also casts the struct device_driver back into a struct pci_driver and then looks at the PCI device-specific information of the device and driver to see if the driver states that it can support this kind of device. If the match is not successful, the function returns 0 back to the driver core, and the driver core moves on to the next driver in its list.

如果匹配成功,函数返回1 给驱动核心。driver这会导致驱动程序核心将 中的指针设置struct device为指向该驱动程序,然后调用 中指定的探测struct device_driver函数。

If the match is successful, the function returns 1 back to the driver core. This causes the driver core to set the driver pointer in the struct device to point to this driver, and then it calls the probe function that is specified in the struct device_driver.

早些时候,在 PCI 驱动程序注册到驱动程序核心之前,该probe变量被设置为指向 pci_device_probe函数。此函数(再次)将 struct deviceback 转换为 astruct pci_dev并将struct driver设备中设置的 转换回 a struct pci_driver。它再次验证该驱动程序是否声明它可以支持该设备(这似乎是出于某种未知原因的冗余额外检查),增加设备的引用计数,然后使用指向该结构的指针调用 PCI 驱动程序的探测函数struct pci_dev它应该绑定到。

Earlier, before the PCI driver was registered with the driver core, the probe variable was set to point at the pci_device_probe function. This function casts (yet again) the struct device back into a struct pci_dev and the struct driver that is set in the device back into a struct pci_driver. It again verifies that this driver states that it can support this device (which seems to be a redundant extra check for some unknown reason), increments the reference count of the device, and then calls the PCI driver's probe function with a pointer to the struct pci_dev structure it should bind to.

如果 PCI 驱动程序的探测函数确定由于某种原因它无法处理该设备,它会返回一个负错误值,该错误值会传播回驱动程序核心,并导致它继续查找驱动程序列表以与该设备相匹配。设备。如果探测函数可以声明该设备,它将执行正确处理该设备所需的所有初始化,然后返回0备份到驱动核心。这会导致驱动程序核心将该设备添加到当前由该特定驱动程序绑定的所有设备的列表中,并在 sysfs 中的驱动程序目录中创建一个符号链接到它现在控制的设备。此符号链接允许用户准确查看哪些设备绑定到哪些设备。这可以看作:

If the PCI driver's probe function determines that it can not handle this device for some reason, it returns a negative error value, which is propagated back to the driver core and causes it to continue looking through the list of drivers to match one up with this device. If the probe function can claim the device, it does all the initialization that it needs to do to handle the device properly, and then it returns 0 back up to the driver core. This causes the driver core to add the device to the list of all devices currently bound by this specific driver and creates a symlink within the driver's directory in sysfs to the device that it is now controlling. This symlink allows users to see exactly which devices are bound to which devices. This can be seen as:

$树/sys/bus/pci
/系统/总线/pci/
|-- 设备
| |-- 0000:00:00.0 -> ../../../devices/pci0000:00/0000:00:00.0
| |-- 0000:00:00.1 -> ../../../devices/pci0000:00/0000:00:00.1
| |-- 0000:00:00.2 -> ../../../devices/pci0000:00/0000:00:00.2
| |-- 0000:00:02.0 -> ../../../devices/pci0000:00/0000:00:02.0
| |-- 0000:00:04.0 -> ../../../devices/pci0000:00/0000:00:04.0
| |-- 0000:00:06.0 -> ../../../devices/pci0000:00/0000:00:06.0
| |-- 0000:00:07.0 -> ../../../devices/pci0000:00/0000:00:07.0
| |-- 0000:00:09.0 -> ../../../devices/pci0000:00/0000:00:09.0
| |-- 0000:00:09.1 -> ../../../devices/pci0000:00/0000:00:09.1
| |-- 0000:00:09.2 -> ../../../devices/pci0000:00/0000:00:09.2
| |-- 0000:00:0c.0 -> ../../../devices/pci0000:00/0000:00:0c.0
| |-- 0000:00:0f.0 -> ../../../devices/pci0000:00/0000:00:0f.0
| |-- 0000:00:10.0 -> ../../../devices/pci0000:00/0000:00:10.0
| |-- 0000:00:12.0 -> ../../../devices/pci0000:00/0000:00:12.0
| |-- 0000:00:13.0 -> ../../../devices/pci0000:00/0000:00:13.0
| `-- 0000:00:14.0 -> ../../../devices/pci0000:00/0000:00:14.0
`-- 驱动程序
    |-- ALI15x3_IDE
    | `-- 0000:00:0f.0 -> ../../../../devices/pci0000:00/0000:00:0f.0
    |-- ehci_hcd
    | `-- 0000:00:09.2 -> ../../../../devices/pci0000:00/0000:00:09.2
    |-- ohci_hcd
    | |-- 0000:00:02.0 -> ../../../../devices/pci0000:00/0000:00:02.0
    | |-- 0000:00:09.0 -> ../../../../devices/pci0000:00/0000:00:09.0
    | `-- 0000:00:09.1 -> ../../../../devices/pci0000:00/0000:00:09.1
    |-- orinoco_pci
    | `-- 0000:00:12.0 -> ../../../../devices/pci0000:00/0000:00:12.0
    |-- radeonfb
    | `-- 0000:00:14.0 -> ../../../../devices/pci0000:00/0000:00:14.0
    |-- 连载
    `——三叉戟
        `-- 0000:00:04.0 -> ../../../../devices/pci0000:00/0000:00:04

.0
$ tree /sys/bus/pci
/sys/bus/pci/
|-- devices
|   |-- 0000:00:00.0 -> ../../../devices/pci0000:00/0000:00:00.0
|   |-- 0000:00:00.1 -> ../../../devices/pci0000:00/0000:00:00.1
|   |-- 0000:00:00.2 -> ../../../devices/pci0000:00/0000:00:00.2
|   |-- 0000:00:02.0 -> ../../../devices/pci0000:00/0000:00:02.0
|   |-- 0000:00:04.0 -> ../../../devices/pci0000:00/0000:00:04.0
|   |-- 0000:00:06.0 -> ../../../devices/pci0000:00/0000:00:06.0
|   |-- 0000:00:07.0 -> ../../../devices/pci0000:00/0000:00:07.0
|   |-- 0000:00:09.0 -> ../../../devices/pci0000:00/0000:00:09.0
|   |-- 0000:00:09.1 -> ../../../devices/pci0000:00/0000:00:09.1
|   |-- 0000:00:09.2 -> ../../../devices/pci0000:00/0000:00:09.2
|   |-- 0000:00:0c.0 -> ../../../devices/pci0000:00/0000:00:0c.0
|   |-- 0000:00:0f.0 -> ../../../devices/pci0000:00/0000:00:0f.0
|   |-- 0000:00:10.0 -> ../../../devices/pci0000:00/0000:00:10.0
|   |-- 0000:00:12.0 -> ../../../devices/pci0000:00/0000:00:12.0
|   |-- 0000:00:13.0 -> ../../../devices/pci0000:00/0000:00:13.0
|   `-- 0000:00:14.0 -> ../../../devices/pci0000:00/0000:00:14.0
`-- drivers
    |-- ALI15x3_IDE
    |   `-- 0000:00:0f.0 -> ../../../../devices/pci0000:00/0000:00:0f.0
    |-- ehci_hcd
    |   `-- 0000:00:09.2 -> ../../../../devices/pci0000:00/0000:00:09.2
    |-- ohci_hcd
    |   |-- 0000:00:02.0 -> ../../../../devices/pci0000:00/0000:00:02.0
    |   |-- 0000:00:09.0 -> ../../../../devices/pci0000:00/0000:00:09.0
    |   `-- 0000:00:09.1 -> ../../../../devices/pci0000:00/0000:00:09.1
    |-- orinoco_pci
    |   `-- 0000:00:12.0 -> ../../../../devices/pci0000:00/0000:00:12.0
    |-- radeonfb
    |   `-- 0000:00:14.0 -> ../../../../devices/pci0000:00/0000:00:14.0
    |-- serial
    `-- trident
        `-- 0000:00:04.0 -> ../../../../devices/pci0000:00/0000:00:04

.0

删除设备

Remove a Device

PCI 设备可以 可以通过多种不同的方式从系统中删除。所有 CardBus 设备实际上都是具有不同物理外形的 PCI 设备,并且内核 PCI 核心并不区分它们。允许在机器仍在运行时移除或添加 PCI 设备的系统变得越来越流行,并且 Linux 支持它们。还有一个假的 PCI Hotplug 驱动程序,允许开发人员测试其 PCI 驱动程序是否可以在系统运行时正确处理设备的删除。该模块称为fakephp并导致内核认为 PCI 设备已消失,但它不允许用户从没有适当硬件的系统中物理删除 PCI 设备。有关如何使用该驱动程序测试 PCI 驱动程序的更多信息,请参阅该驱动程序的文档。

A PCI device can be removed from a system in a number of different ways. All CardBus devices are really PCI devices in a different physical form factor, and the kernel PCI core does not differenciate between them. Systems that allow the removal or addition of PCI devices while the machine is still running are becoming more popular, and Linux supports them. There is also a fake PCI Hotplug driver that allows developers to test to see if their PCI driver properly handles the removal of a device while the system is running. This module is called fakephp and causes the kernel to think the PCI device is gone, but it does not allow users to physically remove a PCI device from a system that does not have the proper hardware to do so. See the documentation with this driver for more information on how to use it to test your PCI drivers.

PCI 核心删除设备所花费的精力比添加设备少得多。当要删除 PCI 设备时, 调用pci_remove_bus_device函数。该函数执行一些特定于 PCI 的清理和内务处理,然后使用 's的指针调用device_unregister函数。struct pci_devstruct device

The PCI core exerts a lot less effort to remove a device than it does to add it. When a PCI device is to be removed, the pci_remove_bus_device function is called. This function does some PCI-specific cleanups and housekeeping, and then calls the device_unregister function with a pointer to the struct pci_dev's struct device member.

device_unregister函数中,驱动程序核心只是取消 sysfs 文件与绑定到设备(如果有的话)的驱动程序的链接,从其内部设备列表中删除该设备,并使用指向包含struct kobjectstruct device结构。该函数对用户空间进行热插拔调用,表明该 kobject 现在已从系统中删除,然后删除与该 kobject 关联的所有 sysfs 文件以及该 kobject 最初创建的 sysfs 目录本身。

In the device_unregister function, the driver core merely unlinks the sysfs files from the driver bound to the device (if there was one), removes the device from its internal list of devices, and calls kobject_del with a pointer to the struct kobject that is contained in the struct device structure. That function makes a hotplug call to user space stating that the kobject is now removed from the system, and then it deletes all sysfs files associated with the kobject and the sysfs directory itself that the kobject had originally created.

kobject_del函数还删除设备本身的 kobject 引用。如果该引用是最后一个(意味着没有为设备的 sysfs 条目打开用户空间文件),则调用PCI 设备本身的释放函数pci_release_dev 。该函数只是释放占用的内存struct pci_dev

The kobject_del function also removes the kobject reference of the device itself. If that reference was the last one (meaning no user-space files were open for the sysfs entry of the device), then the release function for the PCI device itself, pci_release_dev, is called. That function merely frees up the memory that the struct pci_dev took up.

此后,与该设备关联的所有 sysfs 条目都将被删除,并且与该设备关联的内存将被释放。PCI 设备现已从系统中完全删除。

After this, all sysfs entries associated with the device are removed, and the memory associated with the device is released. The PCI device is now totally removed from the system.

添加驱动程序

Add a Driver

添加 PCI 驱动程序 当 PCI 核心调用pci_register_driver函数时。此函数仅初始化struct device_driver该结构中包含的结构 struct pci_driver,如前面有关添加设备的部分中所述。然后PCI核心用指向 该结构体中包含的结构体的 指针来调用驱动程序核心中的 driver_register函数。structdevice_driverstruct pci_driver

A PCI driver is added to the PCI core when it calls the pci_register_driver function. This function merely initializes the struct device_driver structure that is contained within the struct pci_driver structure, as previously mentioned in the section about adding a device. Then the PCI core calls the driver_register function in the driver core with a pointer to the structdevice_driver structure contained in the struct pci_driver structure.

driver_register函数初始化该结构体中几个锁 struct device_driver,然后调用 总线_添加_驱动程序函数。该函数执行以下步骤:

The driver_register function initializes a few locks in the struct device_driver structure, and then calls the bus_add_driver function. This function does the following steps:

  • 查找与驱动程序关联的总线。如果没有找到该总线,该函数立即返回。

  • Looks up the bus that the driver is to be associated with. If this bus is not found, the function instantly returns.

  • 驱动程序的 sysfs 目录是根据驱动程序的名称及其关联的总线创建的。

  • The driver's sysfs directory is created based on the name of the driver and the bus that it is associated with.

  • 获取总线的内部锁,然后遍历所有已在总线上注册的设备,并为它们调用匹配函数,就像添加新设备时一样。如果该匹配函数成功,则会发生其余的绑定过程,如上一节所述。

  • The bus's internal lock is grabbed, and then all devices that have been registered with the bus are walked, and the match function is called for them, just like when a new device is added. If that match function succeeds, then the rest of the binding process occurs, as described in the previous section.

删除驱动程序

Remove a Driver

删除驱动程序是 一个非常简单的动作。对于 PCI 驱动程序,驱动程序调用 pci_unregister_driver函数。该函数仅调用驱动程序核心函数driver_unregister,并传递一个指向 传递给它的结构struct device_driver部分的指针struct pci_driver

Removing a driver is a very simple action. For a PCI driver, the driver calls the pci_unregister_driver function. This function merely calls the driver core function driver_unregister, with a pointer to the struct device_driver portion of the struct pci_driver structure passed to it.

driver_unregister函数通过清理附加到 sysfs 树中驱动程序条目的一些 sysfs 属性来处理一些基本的内务处理。然后它迭代连接到该驱动程序的所有设备并 为其调用释放函数。这与前面提到的当设备从系统中删除时的释放函数完全相同

The driver_unregister function handles some basic housekeeping by cleaning up some sysfs attributes that were attached to the driver's entry in the sysfs tree. It then iterates over all devices that were attached to this driver and calls the release function for it. This happens exactly like the previously mentioned release function for when a device is removed from the system.

所有设备与驱动程序解除绑定后,驱动程序代码执行以下独特的逻辑:

After all devices are unbound from the driver, the driver code does this unique bit of logic:

向下(&drv->unload_sem);
上(&drv->unload_sem);
down(&drv->unload_sem);
up(&drv->unload_sem);

这是在返回函数调用者之前完成的。之所以会获取此锁,是因为代码需要等待该驱动程序上的所有引用计数都下降到0才能安全返回。这是必需的,因为driver_unregister函数最常被调用作为正在卸载的模块的退出路径。只要设备引用驱动程序,该模块就需要保留在内存中,并且通过等待该锁被释放,这使得内核知道何时可以安全地从设备中删除驱动程序。 记忆。

This is done right before returning to the caller of the function. This lock is grabbed because the code needs to wait for all reference counts on this driver to be dropped to 0 before it is safe to return. This is needed because the driver_unregister function is most commonly called as the exit path of a module that is being unloaded. The module needs to remain in memory for as long as the driver is being referenced by devices and by waiting for this lock to be freed, this allows the kernel to know when it is safe to remove the driver from memory.

热插拔

Hotplug

有两种不同的方法 查看热插拔。内核将热插拔视为硬件、内核和内核驱动程序之间的交互。用户将热插拔视为内核和用户空间之间通过名为/sbin/hotplug 的程序进行的交互。当内核想要通知用户空间内核中刚刚发生某种类型的热插拔事件时,内核会调用此程序。

There are two different ways to view hotplugging. The kernel views hotplugging as an interaction between the hardware, the kernel, and the kernel driver. Users view hotplugging as the interaction between the kernel and user space through the program called /sbin/hotplug. This program is called by the kernel when it wants to notify user space that some type of hotplug event has just happened within the kernel.

动态设备

Dynamic Devices

最常用的意思 当讨论大多数计算机系统现在可以处理在系统开机时出现或消失的设备这一事实时,就会出现“热插拔”一词。这与几年前的计算机系统有很大不同,当时的程序员知道他们只需要在启动时扫描所有设备,并且他们永远不必担心他们的设备会消失,直到电源关闭为止。整机。现在,随着 USB、CardBus、PCMCIA、IEEE1394 和 PCI Hotplug 控制器的出现,无论在系统中添加或删除什么硬件,Linux 内核都需要能够可靠地运行。这给设备驱动程序作者带来了额外的负担,

The most commonly used meaning of the term "hotplug" happens when discussing the fact that most all computer systems can now handle devices appearing or disappearing while the system is powered on. This is very different from the computer systems of only a few years ago, where the programmers knew that they needed to scan for all devices only at boot time, and they never had to worry about their devices disappearing until the power was turned off to the whole machine. Now, with the advent of USB, CardBus, PCMCIA, IEEE1394, and PCI Hotplug controllers, the Linux kernel needs to be able to reliably run no matter what hardware is added or removed from the system. This places an added burden on the device driver author, as they must now always handle a device being suddenly ripped out from underneath them without any notice.

每种不同的总线类型以不同的方式处理设备丢失。例如,当 PCI、CardBus 或 PCMCIA 设备从系统中删除时,通常需要一段时间驱动程序才会通过其删除 函数通知此操作。在此之前,所有来自 PCI 总线的读取都会返回所有位设置。这意味着驱动程序需要始终检查从 PCI 总线读取的数据值,并能够正确处理该0xff值。

Each different bus type handles the loss of a device in a different way. For example, when a PCI, CardBus, or PCMCIA device is removed from the system, it is usually a while before the driver is notified of this action through its remove function. Before that happens, all reads from the PCI bus return all bits set. This means that drivers need to always check the value of the data they read from the PCI bus and properly be able to handle a 0xff value.

drivers/usb/host/ehci-hcd.c驱动程序中可以看到这样的一个示例 ,它是 USB 2.0(高速)控制器卡的 PCI 驱动程序。它的主握手循环中有以下代码来检测控制器卡是否已从系统中移除:

An example of this can be seen in the drivers/usb/host/ehci-hcd.c driver, which is a PCI driver for a USB 2.0 (high-speed) controller card. It has the following code in its main handshake loop to detect if the controller card has been removed from the system:

结果 = readl(ptr);
if (结果 = = ~(u32)0) /* 卡被移除 */
    返回-ENODEV;
result = readl(ptr);
if (result =  = ~(u32)0)    /* card removed */
    return -ENODEV;

对于 USB 驱动程序,当 USB 驱动程序绑定到的设备从系统中删除时,提交到该设备的任何挂起的 urb 都会开始失败,并出现错误 -ENODEV。驱动程序需要识别此错误并正确清除任何挂起的 I/O(如果发生)。

For USB drivers, when the device that a USB driver is bound to is removed from the system, any pending urbs that were submitted to the device start failing with the error -ENODEV. The driver needs to recognize this error and properly clean up any pending I/O if it occurs.

可热插拔设备不仅限于鼠标、键盘和网卡等传统设备。现在有许多系统支持移除和添加整个 CPU 和内存条。幸运的是,Linux 内核正确地处理了此类核心“系统”设备的添加和删除,因此各个设备驱动程序不需要关注这些事情。

Hotpluggable devices are not limited only to traditional devices such as mice, keyboards, and network cards. There are numerous systems that now support removal and addition of entire CPUs and memory sticks. Fortunately the Linux kernel properly handles the addition and removal of such core "system" devices so that individual device drivers do not need to pay attention to these things.

/sbin/hotplug 实用程序

The /sbin/hotplug Utility

正如本章前面提到的,每当在系统中添加或删除设备时,都会生成“热插拔事件”。这意味着内核调用用户空间程序 /sbin/热插拔该程序通常是一个非常小的 bash 脚本,仅将执行传递给放置在/etc/hotplug.d/目录树中的其他程序列表 。对于大多数 Linux 发行版,此脚本如下所示:

As alluded to earlier in this chapter, whenever a device is added or removed from the system, a "hotplug event" is generated. This means that the kernel calls the user-space program /sbin/hotplug. This program is typically a very small bash script that merely passes execution on to a list of other programs that are placed in the /etc/hotplug.d/ directory tree. For most Linux distributions, this script looks like the following:

DIR =“/etc/hotplug.d”
对于我在 "${DIR}/$1/"*.hotplug "${DIR}/"default/*.hotplug ;做
    如果 [ -f $I ]; 然后
        测试 -x $I && $I $1 ;
    菲
完毕
1号出口
DIR="/etc/hotplug.d"
for I in "${DIR}/$1/"*.hotplug "${DIR}/"default/*.hotplug ; do
    if [ -f $I ]; then
        test -x $I && $I $1 ;
    fi
done
exit 1

换句话说,该脚本搜索所有可能对此事件感兴趣的带有.hotplug后缀的程序并调用它们,向它们传递内核设置的许多不同的环境变量。有关/sbin/hotplug脚本如何工作的更多详细信息可以在程序的注释和 hotplug(8)联机帮助页中找到。

In other words, the script searches for all programs bearing a .hotplug suffix that might be interested in this event and invokes them, passing to them a number of different environment variables that have been set by the kernel. More details about how the /sbin/hotplug script works can be found in the comments in the program and in the hotplug(8) manpage.

如前所述,每当创建或销毁 kobject 时都会调用/sbin/hotplug 。使用提供事件名称的单个命令行参数来调用热插拔程序。核心内核和涉及的特定子系统还设置了一系列环境变量(如下所述),其中包含有关刚刚发生的情况的信息。热插拔程序使用这些变量来确定内核中刚刚发生的情况,以及是否应该执行任何特定操作。

As mentioned previously, /sbin/hotplug is called whenever a kobject is created or destroyed. The hotplug program is called with a single command-line argument providing a name for the event. The core kernel and specific subsystem involved also set a series of environment variables (described below) with information on what has just occurred. These variables are used by the hotplug programs to determine what has just happened in the kernel, and if there is any specific action that should take place.

传递给/sbin/hotplug 的命令行参数是与此热插拔事件关联的名称,由分配给 kobject 的 kset 确定。这个名称可以通过调用namehotplug_ops函数来设置,该函数是本章前面描述的kset 结构的一部分;如果该函数不存在或从未被调用,则该名称是 kset 本身的名称。

The command-line argument passed to /sbin/hotplug is the name associated with this hotplug event, as determined by the kset assigned to the kobject. This name can be set by a call to the name function that is part of the kset's hotplug_ops structure described earlier in this chapter; if that function is not present or never called, the name is that of the kset itself.

始终为/sbin/hotplug程序设置的默认环境变量 是:

The default environment variables that are always set for the /sbin/hotplug program are:

ACTION
ACTION

字符串addremove,取决于相关对象是刚刚创建还是销毁。

The string add or remove, depending on whether the object in question was just created or destroyed.

DEVPATH
DEVPATH

sysfs 文件系统中的目录路径,指向正在创建或销毁的 kobject。请注意,sysfs 文件系统的挂载点不会添加到此路径中,因此由用户空间程序来确定。

A directory path, within the sysfs filesystem, that points to the kobject that is being either created or destroyed. Note that the mount point of the sysfs filesystem is not added to this path, so it is up to the user-space program to determine that.

SEQNUM
SEQNUM

此热插拔事件的序列号。序列号是一个 64 位数字,每次生成热插拔事件时都会递增。这允许用户空间按照内核生成热插拔事件的顺序对热插拔事件进行排序,因为用户空间程序可能会乱序运行。

The sequence number for this hotplug event. The sequence number is a 64-bit number that is incremented for every hotplug event that is generated. This allows user space to sort the hotplug events in the order in which the kernel generates them, as it is possible for a user-space program to be run out of order.

SUBSYSTEM
SUBSYSTEM

如上所述,相同的字符串作为命令行参数传递。

The same string passed as the command-line argument as described above.

当在系统中添加或删除与总线关联的设备时,许多不同的总线子系统都会将自己的环境变量添加到/sbin/hotplug调用中。他们在分配给总线的热插拔 回调中执行此操作(如第 14.3.1 节struct kset_hotplug_ops所述 )。这允许用户空间能够自动加载控制总线已找到的设备可能需要的任何必要模块。以下是不同总线类型以及它们添加到/sbin/hotplug 调用的环境变量的列表。

A number of the different bus subsystems all add their own environment variables to the /sbin/hotplug call, when devices associated with the bus are added or removed from the system. They do this in their hotplug callback that is specified in the struct kset_hotplug_ops assigned to their bus (as described in Section 14.3.1). This allows user space to be able to automatically load any necessary module that might be needed to control the device that has been found by the bus. Here is a list of the different bus types and what environment variables they add to the /sbin/hotplug call.

IEEE1394(火线)

IEEE1394 (FireWire)

上的任何设备 IEEE1394 总线,也称为 Firewire,具有 /sbin/hotplug参数名称和SUBSYSTEM设置为值 的环境变量ieee1394。子系统ieee1394还始终添加以下四个环境变量:

Any devices on the IEEE1394 bus, also known as Firewire, have the /sbin/hotplug parameter name and the SUBSYSTEM environment variable set to the value ieee1394. The ieee1394 subsystem also always adds the following four environment variables:

VENDOR_ID
VENDOR_ID

IEEE1394 设备的 24 位供应商 ID

The 24-bit vendor ID for the IEEE1394 device

MODEL_ID
MODEL_ID

IEEE1394 设备的 24 位型号 ID

The 24-bit model ID for the IEEE1394 device

GUID
GUID

设备的 64 位 GUID

The 64-bit GUID for the device

SPECIFIER_ID
SPECIFIER_ID

指定该设备的协议规范所有者的 24 位值

The 24-bit value specifying the owner of the protocol spec for this device

VERSION
VERSION

指定该设备的协议规范版本的值

The value that specifies the version of the protocol spec for this device

联网

Networking

全部 当设备在内核中注册或取消注册时,网络设备会创建热插拔事件。/sbin/hotplug调用 将参数名称和SUBSYSTEM环境变量设置为 value net,并且仅添加以下环境变量:

All network devices create a hotplug event when the device is registered or unregistered in the kernel. The /sbin/hotplug call has the parameter name and the SUBSYSTEM environment variable set to the value net, and just adds the following environment variable:

INTERFACE
INTERFACE

已从内核注册或取消注册的接口的名称。这方面的例子有loeth0

The name of the interface that has been registered or unregistered from the kernel. Examples of this are lo and eth0.

PCI

PCI

PCI 总线上的任何设备都有参数名称和SUBSYSTEM设置为值的环境变量pci。PCI 子系统还始终添加以下四个环境变量:

Any devices on the PCI bus have the parameter name and the SUBSYSTEM environment variable set to the value pci. The PCI subsystem also always adds the following four environment variables:

PCI_CLASS
PCI_CLASS

设备的 PCI 类号(十六进制)。

The PCI class number for the device, in hex.

PCI_ID
PCI_ID

设备的 PCI 供应商和设备 ID(十六进制),组合格式为 vendordevice

The PCI vendor and device IDs for the device, in hex, combined in the format vendor:device.

PCI_SUBSYS_ID
PCI_SUBSYS_ID

PCI 子系统供应商和子系统设备 ID,组合格式为 subsys_vendorsubsys_device

The PCI subsystem vendor and subsystem device IDs, combined in the format subsys_vendor:subsys_device.

PCI_SLOT_NAME
PCI_SLOT_NAME

由内核赋予设备的 PCI 插槽“名称”。其格式为domain:::: bus。一个例子可能是。slotfunction0000:00:0d.0

The PCI slot "name" that is given to the device by the kernel. It is in the format domain:bus:slot:function. An example might be 0000:00:0d.0.

输入

Input

对全部 输入设备(鼠标、键盘、操纵杆等),在内核中添加和删除设备时会生成热插拔事件。/sbin/hotplug参数 和SUBSYSTEM环境变量设置为值input。输入子系统还始终添加以下环境变量:

For all input devices (mice, keyboards, joysticks, etc.), a hotplug event is generated when the device is added and removed from the kernel. The /sbin/hotplug parameter and the SUBSYSTEM environment variable are set to the value input. The input subsystem also always adds the following environment variable:

PRODUCT
PRODUCT

列出十六进制值且不带前导零的多值字符串。其格式为bustype:::: vendorproductversion

A multivalue string listing values in hex with no leading zeros. It is in the format bustype:vendor:product:version.

如果设备支持,则可能存在以下环境变量:

The following environment variables may be present, if the device supports it:

NAME
NAME

设备给定的输入设备的名称。

The name of the input device as given by the device.

PHYS
PHYS

输入子系统提供给该设备的设备物理地址。它应该是稳定的,具体取决于设备插入的总线位置。

The device's physical address that the input subsystem gave to this device. It is supposed to be stable, depending on the bus position into which the device was plugged.

EV

KEY

REL

ABS

MSC

LED

SND

FF
EV

KEY

REL

ABS

MSC

LED

SND

FF

这些都来自输入设备描述符,并且如果特定输入设备支持,则设置为适当的值。

These all come from the input device descriptor and are set to the appropriate values if the specific input device supports it.

USB

USB

上的任何设备USB总线有参数名称和SUBSYSTEM设置为 value 的环境变量usb。USB 子系统还始终添加以下环境变量:

Any devices on the USB bus have the parameter name and the SUBSYSTEM environment variable set to the value usb. The USB subsystem also always adds the following environment variables:

PRODUCT
PRODUCT

idVendor//格式idProduct的字符串bcdDevice ,指定那些 USB 设备特定的字段

A string in the format idVendor/idProduct/bcdDevice that specifies those USB device-specific fields

TYPE
TYPE

bDeviceClass//格式bDeviceSubClass的字符串bDeviceProtocol ,指定那些 USB 设备特定的字段

A string in the format bDeviceClass/bDeviceSubClass/bDeviceProtocol that specifies those USB device-specific fields

如果该bDeviceClass字段设置为0,则还会设置以下环境变量:

If the bDeviceClass field is set to 0, the following environment variable is also set:

INTERFACE
INTERFACE

//bInterfaceClass格式的字符串,指定那些 USB 设备特定的字段。bInterfaceSubClassbInterfaceProtocol

A string in the format bInterfaceClass/bInterfaceSubClass/bInterfaceProtocol that specifies those USB device-specific fields.

如果选择了内核构建选项 ,CONFIG_USB_DEVICEFS它选择usbfs要在内核中构建的文件系统,则还会设置以下环境变量:

If the kernel build option, CONFIG_USB_DEVICEFS, which selects the usbfs filesystem to be built in the kernel, is selected, the following environment variable is also set:

DEVICE
DEVICE

usbfs 显示设备在文件系统中的位置的字符串。该字符串的格式为/proc/bus/usb/USB_BUS_NUMBER/USB_DEVICE_NUMBER,其中USB_BUS_NUMBER是设备所在 USB 总线的三位数,也是USB_DEVICE_NUMBER内核已分配给该 USB 设备的三位数。

A string that shows where in the usbfs filesystem the device is located. This string is in the format /proc/bus/usb/USB_BUS_NUMBER/USB_DEVICE_NUMBER, in which USB_BUS_NUMBER is the three-digit number of the USB bus that the device is on, and USB_DEVICE_NUMBER is the three-digit number that has been assigned by the kernel to that USB device.

小型计算机系统接口

SCSI

所有 SCSI 设备都会在以下情况下创建热插拔事件: SCSI 设备在内核中创建或删除。/ sbin/hotplug调用将参数名称和环境变量设置为从系统中添加或删除的每个 SCSI 设备的SUBSYSTEM值。scsiSCSI系统没有添加额外的环境变量,但这里提到它是因为有一个SCSI特定的用户空间脚本可以确定应该为指定的环境加载哪些SCSI驱动程序(磁盘、磁带、通用等) SCSI 设备。

All SCSI devices create a hotplug event when the SCSI device is created or removed from the kernel. The /sbin/hotplug call has the parameter name and the SUBSYSTEM environment variable set to the value scsi for every SCSI device that is added or removed from the system. There are no additional environment variables added by the SCSI system, but it is mentioned here because there is a SCSI-specific user-space script that can determine what SCSI drivers (disk, tape, generic, etc.) should be loaded for the specified SCSI device.

笔记本电脑扩展坞

Laptop docking stations

如果支持即插即用从正在运行的 Linux 系统中添加或删除笔记本电脑扩展坞(通过将笔记本电脑插入扩展坞或将其删除),会创建热插拔事件。/sbin/hotplug调用 将参数名称和SUBSYSTEM环境变量设置为 value dock。没有设置其他环境变量。

If a Plug-and-Play-supported laptop docking station is added or removed from the running Linux system (by inserting the laptop into the station, or removing it), a hotplug event is created. The /sbin/hotplug call has the parameter name and the SUBSYSTEM environment variable set to the value dock. No other environment variables are set.

S/390 和 z 系列

S/390 and zSeries

S/390架构,通道总线架构支持多种硬件,所有这些硬件在Linux虚拟系统中添加或删除时都会生成/sbin/hotplug事件。这些设备都有 /sbin/hotplug参数名称和SUBSYSTEM设置为 value 的环境变量dasd。没有设置其他环境变量。

On the S/390 architecture, the channel bus architecture supports a wide range of hardware, all of which generate /sbin/hotplug events when they are added or removed from the Linux virtual system. These devices all have the /sbin/hotplug parameter name and the SUBSYSTEM environment variable set to the value dasd. No other environment variables are set.

使用 /sbin/hotplug

Using /sbin/hotplug

现在,Linux 内核正在为内核中添加和删除的每个设备调用/sbin/hotplug,因此在用户空间中创建了许多非常有用的工具来利用这一点。两个最流行的工具是 Linux Hotplug 脚本和udev

Now that the Linux kernel is calling /sbin/hotplug for every device added and removed from the kernel, a number of very useful tools have been created in user space that take advantage of this. Two of the most popular tools are the Linux Hotplug scripts and udev.

Linux 热插拔脚本

Linux hotplug scripts

Linux 热插拔 脚本作为/sbin/hotplug调用的第一个用户开始。这些脚本查看内核设置的不同环境变量来描述刚刚发现的设备,然后尝试查找与该设备匹配的内核模块。

The Linux hotplug scripts started out as the very first user of the /sbin/hotplug call. These scripts look at the different environment variables that the kernel sets to describe the device that was just discovered and then tries to find a kernel module that matches up with that device.

如前所述,当驱动程序使用MODULE_DEVICE_TABLE宏时,程序会获取该信息并创建位于/lib/module/KERNEL_VERSION/modules.*mapdepmod中的文件 。根据驱动程序支持的总线类型而有所不同。目前,模块映射文件是为支持 PCI、USB、IEEE1394、INPUT、ISAPNP 和 CCW 子系统的设备的驱动程序生成的。*

As has been described before, when a driver uses the MODULE_DEVICE_TABLE macro, the program, depmod, takes that information and creates the files located in /lib/module/KERNEL_VERSION/modules.*map. The * is different, depending on the bus type that the driver supports. Currently, the module map files are generated for drivers that work for devices that support the PCI, USB, IEEE1394, INPUT, ISAPNP, and CCW subsystems.

热插拔脚本使用这些模块映射文本文件来确定尝试加载哪个模块来支持内核最近发现的设备。它们加载所有模块并且不会在第一个匹配处停止,以便让内核找出哪个模块工作得最好。当设备被移除时,这些脚本不会卸载任何模块。如果他们尝试这样做,他们可能会意外关闭也由与被删除设备相同的驱动程序控制的设备。

The hotplug scripts use these module map text files to determine what module to try to load to support the device that was recently discovered by the kernel. They load all modules and do not stop at the first match, in order to let the kernel work out what module works best. These scripts do not unload any modules when devices are removed. If they were to try to do that, they could accidentally shut down devices that were also controlled by the same driver of the device that was removed.

请注意,现在程序modprobe可以直接从模块读取 MODULE_DEVICE_TABLE信息,而无需模块映射文件,热插拔脚本可能会简化为程序的小包装modprobe

Note, now that the modprobe program can read the MODULE_DEVICE_TABLE information directly from the modules without the need of the module map files, the hotplug scripts might be reduced to a small wrapper around the modprobe program.

乌德夫

udev

在内核中创建统一驱动程序模型的主要原因之一是允许用户空间以动态方式管理/dev树。以前这是通过 devfs 的实现在用户空间中完成的,但由于缺乏活跃的维护者和一些无法修复的核心错误,该代码库已经慢慢腐烂。许多内核开发人员意识到,如果将所有设备信息导出到用户空间,则可以执行 /dev树的所有必要管理。

One of the main reasons for creating the unified driver model in the kernel was to allow user space to manage the /dev tree in a dynamic fashion. This had previously been done in user space with the implementation of devfs, but that code base has slowly rotted away, due to a lack of an active maintainer and some unfixable core bugs. A number of kernel developers realized that if all device information was exported to user space, it could perform all the necessary management of the /dev tree.

devfs 在其设计中存在一些非常根本的缺陷。它要求修改每个设备驱动程序以支持它,并且要求设备驱动程序指定名称和位置。 /dev树放置的位置。它还不能正确处理动态主设备号和次设备号,并且不允许用户空间以简单的方式覆盖设备的命名,从而强制设备命名策略驻留在内核中而不是用户空间中。Linux 内核开发人员真的很讨厌在内核中制定策略,并且由于 devfs 命名策略不遵循 Linux Standard Base 规范,这确实让他们感到困扰。

devfs has some very fundamental flaws in its design. It requires every device driver to be modified to support it, and it requires that device driver to specify the name and location within the /dev tree where it is placed. It also does not properly handle dynamic major and minor numbers, and it does not allow user space to override the naming of a device in a simple manner, forcing the device naming policy to reside within the kernel and not in user space. Linux kernel developers really hate having policy within the kernel, and since the devfs naming policy does not follow the Linux Standard Base specification, it really bothers them.

随着Linux内核开始安装在大型服务器上,许多用户遇到了如何管理大量设备的问题。由 10,000 多个独特设备组成的磁盘驱动器阵列提出了一项非常艰巨的任务,即确保特定磁盘始终以相同的确切名称命名,无论它放置在磁盘阵列中的何处或何时被内核发现。同样的问题也困扰着桌面用户,他们尝试将两台 USB 打印机插入系统,然后意识到,如果系统发生故障,他们无法确保名为/dev/lpt0的打印机不会更改并分配给另一台打印机。重新启动。

As the Linux kernel started to be installed on huge servers, a lot of users ran into the problem of how to manage very large numbers of devices. Disk drive arrays of over 10,000 unique devices presented the very difficult task of ensuring that a specific disk was always named with the same exact name, no matter where it was placed in the disk array or when it was discovered by the kernel. This same problem also plagued desktop users who tried to plug two USB printers into their system and then realized that they had no way of ensuring that the printer known as /dev/lpt0 would not change and be assigned to the other printer if the system was rebooted.

于是,udev就被创建了。它依赖于通过 sysfs 将所有设备信息导出到用户空间,并依赖于 /sbin/hotplug通知设备已添加或删除。策略决策,例如给设备起什么名字,可以在内核之外的用户空间中指定。这确保了命名策略从内核中删除,并允许每个设备的名称具有很大的灵活性。

So, udev was created. It relies on all device information being exported to user space through sysfs and on being notified by /sbin/hotplug that a device was added or removed. Policy decisions, such as what name to give a device, can be specified in user space, outside of the kernel. This ensures that the naming policy is removed from the kernel and allows a large amount of flexibility about the name of each device.

有关如何使用udev以及如何配置它的更多信息,请参阅 您的发行版中udev包附带的文档。

For more information on how to use udev and how to configure it, please see the documentation that comes included with the udev package in your distribution.

为了让udev正常工作,设备驱动程序需要做的就是确保分配给驱动程序控制的设备的任何主设备号和次设备号都通过 sysfs 导出到用户空间。对于任何使用子系统为其分配主编号和次编号的驱动程序,这已经由子系统完成,并且驱动程序不需要执行任何工作。执行此操作的子系统示例包括 tty、misc、usb、input、scsi、block、i2c、网络和帧缓冲区子系统。如果您的驱动程序通过调用 cdev_init函数或更旧的函数自行处理获取主设备号和次设备号 register_chrdev函数,需要修改驱动程序才能使udev正常工作。

All that a device driver needs to do, for udev to work properly with it, is ensure that any major and minor numbers assigned to a device controlled by the driver are exported to user space through sysfs. For any driver that uses a subsystem to assign it a major and minor number, this is already done by the subsystem, and the driver doesn't have to do any work. Examples of subsystems that do this are the tty, misc, usb, input, scsi, block, i2c, network, and frame buffer subsystems. If your driver handles getting a major and minor number on its own, through a call to the cdev_init function or the older register_chrdev function, the driver needs to be modified in order for udev to work properly with it.

udevdevsysfs 的/class/树中查找一个被调用的文件,以便确定当内核通过 / sbin/hotplug接口调用特定设备时分配给特定设备的主设备号和次设备号。设备驱动程序只需要为其控制的每个设备创建该文件。界面class_simple通常是执行此操作的最简单方法。

udev looks for a file called dev in the /class/ tree of sysfs, in order to determine what major and minor number is assigned to a specific device when it is called by the kernel through the /sbin/hotplug interface. A device driver merely needs to create that file for every device it controls. The class_simple interface is usually the easiest way to do this.

正如第 14.5.1 节中提到的 ,使用该接口的第一步class_simple是创建一个struct class_simple调用 class_simple_create函数:

As mentioned in Section 14.5.1 the first step in using the class_simple interface is to create a struct class_simple with a call to the class_simple_create function:

静态结构class_simple *foo_class;
...
foo_class = class_simple_create(THIS_MODULE, "foo");
如果(IS_ERR(foo_class)){
    printk(KERN_ERR "创建 foo 类时出错。\n");
    转到错误;
}
static struct class_simple *foo_class;
...
foo_class = class_simple_create(THIS_MODULE, "foo");
if (IS_ERR(foo_class)) {
    printk(KERN_ERR "Error creating foo class.\n");
    goto error;
}

此代码在 sysfs 的 /sys/class/foo中创建一个目录。

This code creates a directory in sysfs in /sys/class/foo.

每当你的驱动程序发现一个新设备,并且你按照第 3 章中的描述为它分配一个次设备号时,驱动程序应该调用class_simple_device_add 功能:

Whenever a new device is found by your driver, and you assign it a minor number as described in Chapter 3, the driver should call the class_simple_device_add function:

class_simple_device_add(foo_class, MKDEV(FOO_MAJOR, 次要), NULL, "foo%d", 次要);
class_simple_device_add(foo_class, MKDEV(FOO_MAJOR, minor), NULL, "foo%d", minor);

此代码会导致在/sys/class/foo下创建一个名为fooN的子目录,其中N是该设备的次设备号。在此目录中创建了一个文件 dev,这正是udev为您的设备创建设备节点所需的文件。

This code causes a subdirectory under /sys/class/foo to be created called fooN, where N is the minor number for this device. There is one file created in this directory, dev, which is exactly what udev needs in order to create a device node for your device.

当您的驱动程序与设备解除绑定,并且您放弃它所附加的次要号码时,调用 需要class_simple_device_remove来删除该设备的 sysfs 条目:

When your driver is unbound from a device, and you give up the minor number that it was attached to, a call to class_simple_device_remove is needed to remove the sysfs entries for this device:

class_simple_device_remove(MKDEV(FOO_MAJOR, 次要));
class_simple_device_remove(MKDEV(FOO_MAJOR, minor));

稍后,当您的整个驱动程序被关闭时,需要调用 class_simple_destroy来删除您最初通过调用class_simple_create创建的类:

Later, when your entire driver is being shut down, a call to class_simple_destroy is needed to remove the class that you created originally with the call to class_simple_create:

class_simple_destroy(foo_class);
class_simple_destroy(foo_class);

通过调用 class_simple_device_add创建的dev文件由主设备号和次设备号组成,并用 : 字符分隔。如果您的驱动程序不想使用该接口,因为您想在子系统的类目录中提供其他文件,请使用 print_dev_t函数正确格式化子系统的主设备号和次设备号。class_simple 特定设备。

The dev file that is created by the call to class_simple_device_add consists of the major and minor numbers, separated by a : character. If your driver does not want to use the class_simple interface because you want to provide other files within the class directory for the subsystem, use the print_dev_t function to properly format the major and minor number for the specific device.

处理固件

Dealing with Firmware

作为驱动程序作者,您可能会发现 您自己遇到的设备必须先下载固件才能正常运行。硬件市场的许多部分的竞争是如此激烈,以至于即使是用于设备控制固件的一点 EEPROM 的成本也超出了制造商愿意花费的成本。因此,固件与硬件一起分布在 CD 上,操作系统负责将固件传送到设备本身。

As a driver author, you may find yourself confronted with a device that must have firmware downloaded into it before it functions properly. The competition in many parts of the hardware market is so intense that even the cost of a bit of EEPROM for the device's controlling firmware is more than the manufacturer is willing to spend. So the firmware is distributed on a CD with the hardware, and the operating system is charged with conveying the firmware to the device itself.

您可能想通过如下声明来解决固件问题:

You may be tempted to solve the firmware problem with a declaration like this:

静态字符 my_firmware[] = { 0x34, 0x78, 0xa4, ... };
static char my_firmware[  ] = { 0x34, 0x78, 0xa4, ... };

然而,这种方法几乎肯定是一个错误。将固件编码到驱动程序中会使驱动程序代码变得臃肿,使固件升级变得困难,并且很可能会遇到许可问题。供应商不太可能在 GPL 下发布固件映像,因此将其与 GPL 许可的代码混合通常是一个错误。因此,包含有线固件的驱动程序不太可能被接受到主线内核中或被 Linux 发行商包含。

That approach is almost certainly a mistake, however. Coding firmware into a driver bloats the driver code, makes upgrading the firmware hard, and is very likely to run into licensing problems. It is highly unlikely that the vendor has released the firmware image under the GPL, so mixing it with GPL-licensed code is usually a mistake. For this reason, drivers containing wired-in firmware are unlikely to be accepted into the mainline kernel or included by Linux distributors.

内核固件接口

The Kernel Firmware Interface

正确的解决方案是获得 当您需要时从用户空间获取固件。但是,请抵制尝试直接从内核空间打开包含固件的文件的诱惑;这是一个容易出错的操作,它将策略(以文件名的形式)放入内核中。相反,正确的方法是使用专门为此目的而创建的固件接口:

The proper solution is to obtain the firmware from user space when you need it. Please resist the temptation to try to open a file containing firmware directly from kernel space, however; that is an error-prone operation, and it puts policy (in the form of a file name) into the kernel. Instead, the correct approach is to use the firmware interface, which was created just for this purpose:

#include <linux/firmware.h>
int request_firmware(const 结构固件 **fw, char *name,
                     结构设备*设备);
#include <linux/firmware.h>
int request_firmware(const struct firmware **fw, char *name,
                     struct device *device);

对request_firmware的调用请求用户空间定位并向内核提供固件映像;我们稍后会详细了解它的工作原理。应name识别所需的固件;正常用法是供应商提供的固件文件的名称。像my_firmware.bin这样的东西是典型的。如果固件加载成功,则返回值为0 (否则返回通常的错误代码),并且fw参数指向以下结构之一:

A call to request_firmware requests that user space locate and provide a firmware image to the kernel; we look at the details of how it works in a moment. The name should identify the firmware that is desired; the normal usage is the name of the firmware file as provided by the vendor. Something like my_firmware.bin is typical. If the firmware is successfully loaded, the return value is 0 (otherwise the usual error code is returned), and the fw argument is pointed to one of these structures:

结构固件{
        size_t 尺寸;
        u8 *数据;
};
struct firmware {
        size_t size;
        u8 *data;
};

该结构包含实际的固件,现在可以将其下载到设备上。请注意,该固件是来自用户空间的未经检查的数据;在将其发送到硬件之前,您应该应用您能想到的所有测试来说服自己这是一个正确的固件映像。设备固件通常包含标识字符串、校验和等;在信任数据之前检查所有内容。

That structure contains the actual firmware, which can now be downloaded to the device. Be aware that this firmware is unchecked data from user space; you should apply any and all tests you can think of to convince yourself that it is a proper firmware image before sending it to the hardware. Device firmware usually contains identification strings, checksums, and so on; check them all before trusting the data.

将固件发送到设备后,您应该使用以下命令释放内核结构:

After you have sent the firmware to the device, you should release the in-kernel structure with:

无效release_firmware(结构固件* fw);
void release_firmware(struct firmware *fw);

由于request_firmware 请求用户空间帮忙,保证在返回前休眠。如果您的驱动程序在必须请求固件时无法休眠,则可以使用异步替代方案:

Since request_firmware asks user space to help, it is guaranteed to sleep before returning. If your driver is not in a position to sleep when it must ask for firmware, the asynchronous alternative may be used:

int request_firmware_nowait(结构模块 *module,
                            char *名称,struct device *设备,void *上下文,
                            void (*cont)(const 结构固件 *fw, void *context));
int request_firmware_nowait(struct module *module, 
                            char *name, struct device *device, void *context,
                            void (*cont)(const struct firmware *fw, void *context));

这里的附加参数是module(几乎总是THIS_MODULE)、context(固件子系统不使用的私有数据指针)和cont。如果一切顺利, request_firmware_nowait开始固件加载过程并返回0。在将来的某个时间,cont将使用加载结果来调用。如果固件加载由于某种原因失败,fw则为NULL

The additional arguments here are module (which will almost always be THIS_MODULE), context (a private data pointer that is not used by the firmware subsystem), and cont. If all goes well, request_firmware_nowait begins the firmware load process and returns 0. At some future time, cont will be called with the result of the load. If the firmware load fails for some reason, fw is NULL.

怎么运行的

How It Works

固件子系统工作 与 sysfs 和热插拔机制。当调用request_firmware时,会使用您的设备名称在/sys/class/firmware下创建一个新目录。该目录包含三个属性:

The firmware subsystem works with sysfs and the hotplug mechanism. When a call is made to request_firmware, a new directory is created under /sys/class/firmware using your device's name. That directory contains three attributes:

loading
loading

该属性应由加载固件的用户空间进程设置为 1。加载过程完成后,应将其设置为0。写入 的值-1loading中止固件加载过程。

This attribute should be set to one by the user-space process that is loading the firmware. When the load process is complete, it should be set to 0. Writing a value of -1 to loading aborts the firmware loading process.

data
data

data是接收固件数据本身的二进制属性。设置后loading,用户空间进程应将固件写入此属性。

data is a binary attribute that receives the firmware data itself. After setting loading, the user-space process should write the firmware to this attribute.

device
device

此属性是指向/sys/devices下关联条目的符号链接。

This attribute is a symbolic link to the associated entry under /sys/devices.

一旦创建了 sysfs 条目,内核就会为您的设备生成一个热插拔事件。传递给热插拔处理程序的环境包括一个变量,该变量设置为提供给request_firmware 的FIRMWARE名称 。处理程序应该找到固件文件,并使用提供的属性将其复制到内核中。如果找不到文件,处理程序应将加载属性设置为。-1

Once the sysfs entries have been created, the kernel generates a hotplug event for your device. The environment passed to the hotplug handler includes a variable FIRMWARE, which is set to the name provided to request_firmware. The handler should locate the firmware file, and copy it into the kernel using the attributes provided. If the file cannot be found, the handler should set the loading attribute to -1.

如果固件请求在 10 秒内未得到服务,内核将放弃并向驱动程序返回失败状态。该超时期限可以通过 sysfs 属性/sys/class/firmware/timeout进行更改。

If a firmware request is not serviced within 10 seconds, the kernel gives up and returns a failure status to the driver. That time-out period can be changed via the sysfs attribute /sys/class/firmware/timeout.

使用request_firmware接口允许您随驱动程序一起分发设备固件。当正确集成到热插拔机制中时,固件加载子系统允许设备“开箱即用”地工作。这显然是处理问题的最佳方式。

Using the request_firmware interface allows you to distribute the device firmware with your driver. When properly integrated into the hotplug mechanism, the firmware loading subsystem allows devices to simply work "out of the box." It is clearly the best way of handling the problem.

不过,请允许我们再次发出警告:未经制造商许可,不得分发设备固件。当礼貌地询问时,许多制造商会同意以合理的条款许可其固件;其他一些人可能不太合作。无论哪种方式,未经许可复制和分发其固件都是侵犯版权的行为 法律和麻烦的邀请。

Please indulge us as we pass on one more warning, however: device firmware should not be distributed without the permission of the manufacturer. Many manufacturers will agree to license their firmware under reasonable terms when asked politely; some others can be less cooperative. Either way, copying and distributing their firmware without permission is a violation of copyright law and an invitation for trouble.

快速参考

Quick Reference

本章介绍了很多功能;以下是对所有内容的快速总结。

Many functions have been introduced in this chapter; here is a quick summary of them all.

对象

Kobjects

#include <linux/kobject.h>
#include <linux/kobject.h>

包含文件包含 kobject、相关结构和函数的定义。

The include file containing definitions for kobjects, related structures, and functions.

void kobject_init(struct kobject *kobj);

int kobject_set_name(struct kobject *kobj, const char *format, ...);
void kobject_init(struct kobject *kobj);

int kobject_set_name(struct kobject *kobj, const char *format, ...);

用于 kobject 初始化的函数。

Functions for kobject initialization.

struct kobject *kobject_get(struct kobject *kobj);

void kobject_put(struct kobject *kobj);
struct kobject *kobject_get(struct kobject *kobj);

void kobject_put(struct kobject *kobj);

管理 kobject 引用计数的函数。

Functions that manage reference counts for kobjects.

struct kobj_type;

struct kobj_type *get_ktype(struct kobject *kobj);
struct kobj_type;

struct kobj_type *get_ktype(struct kobject *kobj);

表示 kobject 嵌入其中的结构类型。使用 get_ktype获取kobj_type与给定 kobject 关联的类型。

Represents the type of structure within which a kobject is embedded. Use get_ktype to get the kobj_type associated with a given kobject.

int kobject_add(struct kobject *kobj);

extern int kobject_register(struct kobject *kobj);

void kobject_del(struct kobject *kobj);

void kobject_unregister(struct kobject *kobj);
int kobject_add(struct kobject *kobj);

extern int kobject_register(struct kobject *kobj);

void kobject_del(struct kobject *kobj);

void kobject_unregister(struct kobject *kobj);

kobject_add将一个 kobject 添加到系统中,处理 kset 成员资格、sysfs 表示和热插拔事件生成。 kobject_register是一个结合了 kobject_initkobject_add 的便捷函数。使用 kobject_del删除 kobject 或 kobject_unregister,它结合了 kobject_delkobject_put

kobject_add adds a kobject to the system, handling kset membership, sysfs representation, and hotplug event generation. kobject_register is a convenience function that combines kobject_init and kobject_add. Use kobject_del to remove a kobject or kobject_unregister, which combines kobject_del and kobject_put.

void kset_init(struct kset *kset);

int kset_add(struct kset *kset);

int kset_register(struct kset *kset);

void kset_unregister(struct kset *kset);
void kset_init(struct kset *kset);

int kset_add(struct kset *kset);

int kset_register(struct kset *kset);

void kset_unregister(struct kset *kset);

kset 的初始化和注册函数。

Initialization and registration functions for ksets.

decl_subsys(name, type, hotplug_ops);
decl_subsys(name, type, hotplug_ops);

一个可以更轻松地声明子系统的宏。

A macro that makes it easier to declare subsystems.

void subsystem_init(struct subsystem *subsys);

int subsystem_register(struct subsystem *subsys);

void subsystem_unregister(struct subsystem *subsys);

struct subsystem *subsys_get(struct subsystem *subsys)

void subsys_put(struct subsystem *subsys);
void subsystem_init(struct subsystem *subsys);

int subsystem_register(struct subsystem *subsys);

void subsystem_unregister(struct subsystem *subsys);

struct subsystem *subsys_get(struct subsystem *subsys)

void subsys_put(struct subsystem *subsys);

对子系统的操作。

Operations on subsystems.

系统文件系统操作

Sysfs Operations

#include <linux/sysfs.h>
#include <linux/sysfs.h>

包含文件包含 sysfs 的声明。

The include file containing declarations for sysfs.

int sysfs_create_file(struct kobject *kobj, struct attribute *attr);

int sysfs_remove_file(struct kobject *kobj, struct attribute *attr);

int sysfs_create_bin_file(struct kobject *kobj, struct bin_attribute *attr);

int sysfs_remove_bin_file(struct kobject *kobj, struct bin_attribute *attr);

int sysfs_create_link(struct kobject *kobj, struct kobject *target, char

*name);

void sysfs_remove_link(struct kobject *kobj, char *name);
int sysfs_create_file(struct kobject *kobj, struct attribute *attr);

int sysfs_remove_file(struct kobject *kobj, struct attribute *attr);

int sysfs_create_bin_file(struct kobject *kobj, struct bin_attribute *attr);

int sysfs_remove_bin_file(struct kobject *kobj, struct bin_attribute *attr);

int sysfs_create_link(struct kobject *kobj, struct kobject *target, char

*name);

void sysfs_remove_link(struct kobject *kobj, char *name);

用于创建和删除与 kobject 关联的属性文件的函数。

Functions for creating and removing attribute files associated with a kobject.

总线、设备和驱动程序

Buses, Devices, and Drivers

int bus_register(struct bus_type *bus);

void bus_unregister(struct bus_type *bus);
int bus_register(struct bus_type *bus);

void bus_unregister(struct bus_type *bus);

功能 在设备模型中执行总线的注册和注销。

Functions that perform registration and unregistration of buses in the device model.

int bus_for_each_dev(struct bus_type *bus, struct device *start, void *data,

int (*fn)(struct device *, void *));

int bus_for_each_drv(struct bus_type *bus, struct device_driver *start, void

*data, int (*fn)(struct device_driver *, void *));
int bus_for_each_dev(struct bus_type *bus, struct device *start, void *data,

int (*fn)(struct device *, void *));

int bus_for_each_drv(struct bus_type *bus, struct device_driver *start, void

*data, int (*fn)(struct device_driver *, void *));

分别迭代连接到给定总线的每个设备和驱动程序的函数。

Functions that iterate over each of the devices and drivers, respectively, that are attached to the given bus.

BUS_ATTR(name, mode, show, store);

int bus_create_file(struct bus_type *bus, struct bus_attribute *attr);

void bus_remove_file(struct bus_type *bus, struct bus_attribute *attr);
BUS_ATTR(name, mode, show, store);

int bus_create_file(struct bus_type *bus, struct bus_attribute *attr);

void bus_remove_file(struct bus_type *bus, struct bus_attribute *attr);

BUS_ATTR宏可用于声明一个结构,然后bus_attribute可以使用上述两个函数添加和删除该结构。

The BUS_ATTR macro may be used to declare a bus_attribute structure, which may then be added and removed with the above two functions.

int device_register(struct device *dev);

void device_unregister(struct device *dev);
int device_register(struct device *dev);

void device_unregister(struct device *dev);

处理设备注册的函数。

Functions that handle device registration.

DEVICE_ATTR(name, mode, show, store);

int device_create_file(struct device *device, struct device_attribute *entry);

void device_remove_file(struct device *dev, struct device_attribute *attr);
DEVICE_ATTR(name, mode, show, store);

int device_create_file(struct device *device, struct device_attribute *entry);

void device_remove_file(struct device *dev, struct device_attribute *attr);

处理设备属性的宏和函数。

Macros and functions that deal with device attributes.

int driver_register(struct device_driver *drv);

void driver_unregister(struct device_driver *drv);
int driver_register(struct device_driver *drv);

void driver_unregister(struct device_driver *drv);

注册和取消注册设备驱动程序的函数。

Functions that register and unregister a device driver.

DRIVER_ATTR(name, mode, show, store);

int driver_create_file(struct device_driver *drv, struct driver_attribute

*attr);

void driver_remove_file(struct device_driver *drv, struct driver_attribute

*attr);
DRIVER_ATTR(name, mode, show, store);

int driver_create_file(struct device_driver *drv, struct driver_attribute

*attr);

void driver_remove_file(struct device_driver *drv, struct driver_attribute

*attr);

管理驱动程序属性的宏和函数。

Macros and functions that manage driver attributes.

课程

Classes

struct class_simple *class_simple_create(struct module *owner, char *name);

void class_simple_destroy(struct class_simple *cs);

struct class_device *class_simple_device_add(struct class_simple *cs, dev_t

devnum, struct device *device, const char *fmt, ...);

void class_simple_device_remove(dev_t dev);

int class_simple_set_hotplug(struct class_simple *cs, int (*hotplug)(struct

class_device *dev, char **envp, int num_envp, char *buffer, int

buffer_size));
struct class_simple *class_simple_create(struct module *owner, char *name);

void class_simple_destroy(struct class_simple *cs);

struct class_device *class_simple_device_add(struct class_simple *cs, dev_t

devnum, struct device *device, const char *fmt, ...);

void class_simple_device_remove(dev_t dev);

int class_simple_set_hotplug(struct class_simple *cs, int (*hotplug)(struct

class_device *dev, char **envp, int num_envp, char *buffer, int

buffer_size));

实现class_simple 接口的函数;它们管理包含属性的简单类条目dev,仅此而已。

Functions that implement the class_simple interface; they manage simple class entries containing a dev attribute and little else.

int class_register(struct class *cls);

void class_unregister(struct class *cls);
int class_register(struct class *cls);

void class_unregister(struct class *cls);

课程注册和注销。

Registration and unregistration of classes.

CLASS_ATTR(name, mode, show, store);

int class_create_file(struct class *cls, const struct class_attribute *attr);

void class_remove_file(struct class *cls, const struct class_attribute *attr);
CLASS_ATTR(name, mode, show, store);

int class_create_file(struct class *cls, const struct class_attribute *attr);

void class_remove_file(struct class *cls, const struct class_attribute *attr);

用于处理类属性的常用宏和函数。

The usual macros and functions for dealing with class attributes.

int class_device_register(struct class_device *cd);

void class_device_unregister(struct class_device *cd);

int class_device_rename(struct class_device *cd, char *new_name);

CLASS_DEVICE_ATTR(name, mode, show, store);

int class_device_create_file(struct class_device *cls, const struct

class_device_attribute *attr);

void class_device_remove_file(struct class_device *cls, const struct

class_device_attribute *attr);
int class_device_register(struct class_device *cd);

void class_device_unregister(struct class_device *cd);

int class_device_rename(struct class_device *cd, char *new_name);

CLASS_DEVICE_ATTR(name, mode, show, store);

int class_device_create_file(struct class_device *cls, const struct

class_device_attribute *attr);

void class_device_remove_file(struct class_device *cls, const struct

class_device_attribute *attr);

实现类设备接口的函数和宏。

Functions and macros that implement the class device interface.

int class_interface_register(struct class_interface *intf);

void class_interface_unregister(struct class_interface *intf);
int class_interface_register(struct class_interface *intf);

void class_interface_unregister(struct class_interface *intf);

将接口添加到类(或删除它)的函数。

Functions that add an interface to a class (or remove it).

固件

Firmware

#include <linux/firmware.h>

int request_firmware(const struct firmware **fw, char *name, struct device

*device);

int request_firmware_nowait(struct module *module, char *name, struct device

*device, void *context, void (*cont)(const struct firmware *fw, void

*context));

void release_firmware(struct firmware *fw);
#include <linux/firmware.h>

int request_firmware(const struct firmware **fw, char *name, struct device

*device);

int request_firmware_nowait(struct module *module, char *name, struct device

*device, void *context, void (*cont)(const struct firmware *fw, void

*context));

void release_firmware(struct firmware *fw);

实现内核固件加载接口的函数。

Functions that implement the kernel firmware-loading interface.




[ 1 ]该总线的逻辑名称当然是“sbus”,但该名称已被真实的物理总线所采用。

[1] The logical name for this bus, of course, would have been "sbus," but that name was already taken by a real, physical bus.

第 15 章内存映射和 DMA

Chapter 15. Memory Mapping and DMA

本章深入研究 Linux 内存管理领域,重点介绍对设备驱动程序编写者有用的技术。许多类型的驱动程序编程都需要对虚拟内存子系统的工作原理有一定的了解;当我们进入一些更复杂和性能关键的子系统时,本章中介绍的材料会不止一次地派上用场。虚拟内存子系统也是核心 Linux 内核中非常有趣的部分,因此值得一看。

This chapter delves into the area of Linux memory management, with an emphasis on techniques that are useful to the device driver writer. Many types of driver programming require some understanding of how the virtual memory subsystem works; the material we cover in this chapter comes in handy more than once as we get into some of the more complex and performance-critical subsystems. The virtual memory subsystem is also a highly interesting part of the core Linux kernel and, therefore, it merits a look.

本章的材料分为三个部分:

The material in this chapter is divided into three sections:

  • 第一个部分介绍了mmap系统调用的实现,它允许将设备内存直接映射到用户进程的地址空间。并非所有设备都需要mmap支持,但对于某些设备来说,映射设备内存可以显着提高性能。

  • The first covers the implementation of the mmap system call, which allows the mapping of device memory directly into a user process's address space. Not all devices require mmap support, but, for some, mapping device memory can yield significant performance improvements.

  • 然后,我们通过讨论直接访问用户空间页面来从另一个方向跨越边界。相对较少的驱动程序需要此功能;在许多情况下,内核会在驱动程序不知情的情况下执行这种映射。但是了解如何将用户空间内存映射到内核(使用 get_user_pages)可能会很有用。

  • We then look at crossing the boundary from the other direction with a discussion of direct access to user-space pages. Relatively few drivers need this capability; in many cases, the kernel performs this sort of mapping without the driver even being aware of it. But an awareness of how to map user-space memory into the kernel (with get_user_pages) can be useful.

  • 最后一部分涵盖 直接内存访问 (DMA) I/O 操作,为外设提供对系统内存的直接访问。

  • The final section covers direct memory access (DMA) I/O operations, which provide peripherals with direct access to system memory.

当然,所有这些技术都需要了解 Linux 内存管理的工作原理,因此我们首先概述该子系统。

Of course, all of these techniques require an understanding of how Linux memory management works, so we start with an overview of that subsystem.

Linux 中的内存管理

Memory Management in Linux

而不是描述 操作系统中的内存管理理论,本节试图指出 Linux 实现的主要特征。尽管您不需要成为 Linux 虚拟内存专家来实现mmap,但对事物工作原理的基本概述还是很有用的。下面是对内核用来管理内存的数据结构的相当长的描述。一旦了解了必要的背景,我们就可以开始使用这些结构。

Rather than describing the theory of memory management in operating systems, this section tries to pinpoint the main features of the Linux implementation. Although you do not need to be a Linux virtual memory guru to implement mmap, a basic overview of how things work is useful. What follows is a fairly lengthy description of the data structures used by the kernel to manage memory. Once the necessary background has been covered, we can get into working with these structures.

地址类型

Address Types

Linux 当然是一个虚拟内存系统,这意味着用户程序看到的地址并不直接对应于硬件使用的物理地址。虚拟内存引入了一个间接层,它允许许多好处。通过虚拟内存,系统上运行的程序可以分配比物理可用内存多得多的内存;事实上,即使是单个进程也可以拥有比系统物理内存更大的虚拟地址空间。虚拟内存还允许程序对进程的地址空间进行多种操作,包括将程序的内存映射到设备内存。

Linux is, of course, a virtual memory system, meaning that the addresses seen by user programs do not directly correspond to the physical addresses used by the hardware. Virtual memory introduces a layer of indirection that allows a number of nice things. With virtual memory, programs running on the system can allocate far more memory than is physically available; indeed, even a single process can have a virtual address space larger than the system's physical memory. Virtual memory also allows the program to play a number of tricks with the process's address space, including mapping the program's memory to device memory.

到目前为止,我们已经讨论了虚拟地址和物理地址,但许多细节被掩盖了。Linux 系统处理多种类型的地址,每种类型都有自己的语义。不幸的是,内核代码并不总是非常清楚每种情况下使用的地址类型,因此程序员必须小心。

Thus far, we have talked about virtual and physical addresses, but a number of the details have been glossed over. The Linux system deals with several types of addresses, each with its own semantics. Unfortunately, the kernel code is not always very clear on exactly which type of address is being used in each situation, so the programmer must be careful.

以下是 Linux 中使用的地址类型列表。图 15-1显示了这些地址类型与物理内存的关系。

The following is a list of address types used in Linux. Figure 15-1 shows how these address types relate to physical memory.

用户虚拟地址
User virtual addresses

这些是用户空间程序看到的常规地址。用户地址的长度为 32 位或 64 位,具体取决于底层硬件架构,并且每个进程都有自己的虚拟地址空间。

These are the regular addresses seen by user-space programs. User addresses are either 32 or 64 bits in length, depending on the underlying hardware architecture, and each process has its own virtual address space.

物理地址
Physical addresses

处理器和系统内存之间使用的地址。物理地址是 32 位或 64 位数量;在某些情况下,甚至 32 位系统也可以使用更大的物理地址。

The addresses used between the processor and the system's memory. Physical addresses are 32- or 64-bit quantities; even 32-bit systems can use larger physical addresses in some situations.

巴士地址
Bus addresses

外设总线和内存之间使用的地址。通常,它们与处理器使用的物理地址相同,但情况不一定如此。某些架构可以提供 I/O内存管理单元 (IOMMU),用于在总线和主内存之间重新映射地址。IOMMU 可以通过多种方式让生活变得更轻松(例如,使分散在内存中的缓冲区看起来与设备连续),但对 IOMMU 进行编程是设置 DMA 操作时必须执行的额外步骤。当然,总线地址高度依赖于体系结构。

The addresses used between peripheral buses and memory. Often, they are the same as the physical addresses used by the processor, but that is not necessarily the case. Some architectures can provide an I/O memory management unit (IOMMU) that remaps addresses between a bus and main memory. An IOMMU can make life easier in a number of ways (making a buffer scattered in memory appear contiguous to the device, for example), but programming the IOMMU is an extra step that must be performed when setting up DMA operations. Bus addresses are highly architecture dependent, of course.

内核逻辑地址
Kernel logical addresses

这些构成了正常的 内核的地址空间。这些地址映射主存储器的某些部分(可能是全部),并且通常被视为物理地址。在大多数体系结构中,逻辑地址及其关联的物理地址仅存在恒定的偏移量。逻辑地址使用硬件的本机指针大小,因此可能无法寻址装备齐全的 32 位系统上的所有物理内存。逻辑地址通常存储在unsigned long或类型的变量中void *从kmalloc返回的内存 有一个内核逻辑地址。

These make up the normal address space of the kernel. These addresses map some portion (perhaps all) of main memory and are often treated as if they were physical addresses. On most architectures, logical addresses and their associated physical addresses differ only by a constant offset. Logical addresses use the hardware's native pointer size and, therefore, may be unable to address all of physical memory on heavily equipped 32-bit systems. Logical addresses are usually stored in variables of type unsigned long or void *. Memory returned from kmalloc has a kernel logical address.

内核虚拟地址
Kernel virtual addresses

内核虚拟地址是 与逻辑地址类似,它们是从内核空间地址到物理地址的映射。然而,内核虚拟地址不一定具有到表征逻辑地址空间的物理地址的线性、一对一映射。所有逻辑地址 都是内核虚拟地址,但许多内核虚拟地址不是逻辑地址。例如, vmalloc分配的内存有一个虚拟地址(但没有直接的物理映射)。kmap函数(本章稍后介绍)也返回虚拟地址。虚拟地址通常存储在指针变量中。

Kernel virtual addresses are similar to logical addresses in that they are a mapping from a kernel-space address to a physical address. Kernel virtual addresses do not necessarily have the linear, one-to-one mapping to physical addresses that characterize the logical address space, however. All logical addresses are kernel virtual addresses, but many kernel virtual addresses are not logical addresses. For example, memory allocated by vmalloc has a virtual address (but no direct physical mapping). The kmap function (described later in this chapter) also returns virtual addresses. Virtual addresses are usually stored in pointer variables.

Linux 中使用的地址类型

图 15-1。Linux 中使用的地址类型

Figure 15-1. Address types used in Linux

如果有逻辑地址,则宏_ _pa( ) (在<asm/page.h>中定义 )返回其关联的物理地址。可以使用__va()将物理地址映射回逻辑地址,但仅限于低内存页。

If you have a logical address, the macro _ _pa( ) (defined in <asm/page.h>) returns its associated physical address. Physical addresses can be mapped back to logical addresses with _ _va( ), but only for low-memory pages.

不同的内核函数需要不同类型的地址。如果定义了不同的 C 类型就好了,这样所需的地址类型是明确的,但我们没有这样的运气。在本章中,我们试图弄清楚在哪里使用哪种类型的地址。

Different kernel functions require different types of addresses. It would be nice if there were different C types defined, so that the required address types were explicit, but we have no such luck. In this chapter, we try to be clear on which types of addresses are used where.

物理地址和页面

Physical Addresses and Pages

物理内存 被分为称为页的离散单元。系统对内存的大部分内部处理都是按页完成的。尽管当前大多数系统使用 4096 字节页面,但页面大小因架构而异。该常量PAGE_SIZE(在<asm/page.h>中定义)给出了任何给定体系结构上的页面大小。

Physical memory is divided into discrete units called pages. Much of the system's internal handling of memory is done on a per-page basis. Page size varies from one architecture to the next, although most systems currently use 4096-byte pages. The constant PAGE_SIZE (defined in <asm/page.h>) gives the page size on any given architecture.

如果查看内存地址(虚拟地址或物理地址),它可以分为页号和页内偏移量。例如,如果使用 4096 字节的页面,则 12 个最低有效位是偏移量,其余的较高位表示页号。如果丢弃偏移量并将偏移量的其余部分向右移动,则结果称为页框号 (PFN)。移位以在页框编号和地址之间进行转换是相当常见的手术; 宏观 PAGE_SHIFT 指示必须移动多少位才能进行此转换。

If you look at a memory address—virtual or physical—it is divisible into a page number and an offset within the page. If 4096-byte pages are being used, for example, the 12 least-significant bits are the offset, and the remaining, higher bits indicate the page number. If you discard the offset and shift the rest of an offset to the right, the result is called a page frame number (PFN). Shifting bits to convert between page frame numbers and addresses is a fairly common operation; the macro PAGE_SHIFT tells how many bits must be shifted to make this conversion.

高内存和低内存

High and Low Memory

逻辑地址和内核虚拟地址之间的差异在配备大量内存的 32 位系统上尤为突出。使用 32 位,可以寻址 4 GB 内存。然而,直到最近,由于 32 位系统上的 Linux 设置虚拟地址空间的方式,其内存数量仍受到限制。

The difference between logical and kernel virtual addresses is highlighted on 32-bit systems that are equipped with large amounts of memory. With 32 bits, it is possible to address 4 GB of memory. Linux on 32-bit systems has, until recently, been limited to substantially less memory than that, however, because of the way it sets up the virtual address space.

内核(在 x86 架构上,默认配置)在用户空间和内核之间分割 4 GB 虚拟地址空间;两种上下文中都使用同一组映射。典型的分割将 3 GB 专用于用户空间,1 GB 用于内核空间。[ 1 ]内核的代码和数据结构必须适合该空间,但内核地址空间的最大消耗者是物理内存的虚拟映射。内核无法直接操作未映射到内核地址空间的内存。换句话说,内核需要它自己的虚拟地址来存储它必须直接接触的任何内存。因此,多年来,内核可以处理的最大物理内存量是可以映射到虚拟地址空间的内核部分的量,减去内核代码本身所需的空间。因此,基于 x86 的 Linux 系统最多可以使用略低于 1 GB 的物理内存。

The kernel (on the x86 architecture, in the default configuration) splits the 4-GB virtual address space between user-space and the kernel; the same set of mappings is used in both contexts. A typical split dedicates 3 GB to user space, and 1 GB for kernel space.[1] The kernel's code and data structures must fit into that space, but the biggest consumer of kernel address space is virtual mappings for physical memory. The kernel cannot directly manipulate memory that is not mapped into the kernel's address space. The kernel, in other words, needs its own virtual address for any memory it must touch directly. Thus, for many years, the maximum amount of physical memory that could be handled by the kernel was the amount that could be mapped into the kernel's portion of the virtual address space, minus the space needed for the kernel code itself. As a result, x86-based Linux systems could work with a maximum of a little under 1 GB of physical memory.

为了应对支持更多内存同时又不破坏32位应用程序和系统兼容性的商业压力,处理器制造商在其产品中添加了“地址扩展”功能。结果是,在许多情况下,即使是 32 位处理器也可以寻址超过 4 GB 的物理内存。然而,有多少内存可以直接与逻辑地址映射的限制仍然存在。只有内存的最低部分(最多 1 或 2 GB,具体取决于硬件和内核配置)具有逻辑地址;[ 2 ]其余的(高内存)则不然。在访问特定的高内存页面之前,内核必须设置显式虚拟映射以使该页面在内核地址空间中可用。因此,许多内核数据结构必须放置在低内存中;高内存往往是为用户空间进程页面保留的。

In response to commercial pressure to support more memory while not breaking 32-bit application and the system's compatibility, the processor manufacturers have added "address extension" features to their products. The result is that, in many cases, even 32-bit processors can address more than 4 GB of physical memory. The limitation on how much memory can be directly mapped with logical addresses remains, however. Only the lowest portion of memory (up to 1 or 2 GB, depending on the hardware and the kernel configuration) has logical addresses;[2] the rest (high memory) does not. Before accessing a specific high-memory page, the kernel must set up an explicit virtual mapping to make that page available in the kernel's address space. Thus, many kernel data structures must be placed in low memory; high memory tends to be reserved for user-space process pages.

“高内存”一词可能会让一些人感到困惑,特别是因为它在 PC 世界中还有其他含义。因此,为了清楚起见,我们将在这里定义术语:

The term "high memory" can be confusing to some, especially since it has other meanings in the PC world. So, to make things clear, we'll define the terms here:

记忆不足
Low memory

逻辑地址存在于内核空间中的内存。在您可能遇到的几乎每个系统上,所有内存都是低内存。

Memory for which logical addresses exist in kernel space. On almost every system you will likely encounter, all memory is low memory.

高记忆力
High memory

不存在逻辑地址的内存,因为它超出了为内核虚拟地址预留的地址范围。

Memory for which logical addresses do not exist, because it is beyond the address range set aside for kernel virtual addresses.

在 i386 系统上,低内存和高内存之间的边界通常设置为略低于 1 GB,尽管该边界可以在内核配置时更改。此边界与原始 PC 上的旧 640 KB 限制没有任何关系,并且其位置也不受硬件决定。相反,它是内核本身在内核和用户空间之间划分 32 位地址空间时设置的限制。

On i386 systems, the boundary between low and high memory is usually set at just under 1 GB, although that boundary can be changed at kernel configuration time. This boundary is not related in any way to the old 640 KB limit found on the original PC, and its placement is not dictated by the hardware. It is, instead, a limit set by the kernel itself as it splits the 32-bit address space between kernel and user space.

我们会指出 我们将在本章中讨论高级内存的使用限制。

We will point out limitations on the use of high memory as we come to them in this chapter.

内存映射和结构页

The Memory Map and Struct Page

从历史上看,内核 使用逻辑地址来引用物理内存页。然而,增加高内存支持暴露了该方法的一个明显问题——逻辑地址不可用于高内存。因此,处理内存的内核函数越来越多地使用指向struct page(在<linux/mm.h>中定义)的指针。该数据结构用于跟踪内核需要了解的有关物理内存的几乎所有信息;struct page系统上的每个物理页都有一个 。该结构的一些字段包括以下内容:

Historically, the kernel has used logical addresses to refer to pages of physical memory. The addition of high-memory support, however, has exposed an obvious problem with that approach—logical addresses are not available for high memory. Therefore, kernel functions that deal with memory are increasingly using pointers to struct page (defined in <linux/mm.h>) instead. This data structure is used to keep track of just about everything the kernel needs to know about physical memory; there is one struct page for each physical page on the system. Some of the fields of this structure include the following:

atomic_t count;
atomic_t count;

此页面的引用数量。当计数下降到 时 0,页面将返回到空闲列表。

The number of references there are to this page. When the count drops to 0, the page is returned to the free list.

void *virtual;
void *virtual;

该页的内核虚拟地址(如果已映射);NULL, 否则。低内存页面总是被映射;高内存页通常不是。该字段并不出现在所有架构上;通常仅在无法轻松计算页面的内核虚拟地址的情况下才编译它。如果你想查看这个字段,正确的方法是使用page_address 宏,如下所述。

The kernel virtual address of the page, if it is mapped; NULL, otherwise. Low-memory pages are always mapped; high-memory pages usually are not. This field does not appear on all architectures; it generally is compiled only where the kernel virtual address of a page cannot be easily calculated. If you want to look at this field, the proper method is to use the page_address macro, described below.

unsigned long flags;
unsigned long flags;

描述页面状态的一组位标志。其中包括PG_locked,它表明该页面已被锁定在内存中,以及PG_reserved,它根本阻止内存管理系统使用该页面。

A set of bit flags describing the status of the page. These include PG_locked, which indicates that the page has been locked in memory, and PG_reserved, which prevents the memory management system from working with the page at all.

里面有更多的信息struct page,但它是内存管理更深层次黑魔法的一部分,驱动程序编写者不关心。

There is much more information within struct page, but it is part of the deeper black magic of memory management and is not of concern to driver writers.

内核维护一个或更多struct page跟踪系统上所有物理内存的条目数组。在某些系统上,有一个名为 的数组mem_map。然而,在某些系统上,情况更为复杂。 非均匀内存访问 (NUMA) 系统和具有广泛不连续物理内存的系统可能具有多个内存映射数组,因此可移植的代码应尽可能避免直接访问该数组。幸运的是,通常很容易只使用struct page指针而不必担心它们来自哪里。

The kernel maintains one or more arrays of struct page entries that track all of the physical memory on the system. On some systems, there is a single array called mem_map. On some systems, however, the situation is more complicated. Nonuniform memory access (NUMA) systems and those with widely discontiguous physical memory may have more than one memory map array, so code that is meant to be portable should avoid direct access to the array whenever possible. Fortunately, it is usually quite easy to just work with struct page pointers without worrying about where they come from.

定义了一些函数和宏来在struct page指针和虚拟地址之间进行转换:

Some functions and macros are defined for translating between struct page pointers and virtual addresses:

struct page *virt_to_page(void *kaddr);
struct page *virt_to_page(void *kaddr);

该宏在<asm/page.h>中定义,采用内核逻辑地址并返回其关联的struct page指针。由于它需要逻辑地址,因此它不适用于vmalloc或高端内存中的内存。

This macro, defined in <asm/page.h>, takes a kernel logical address and returns its associated struct page pointer. Since it requires a logical address, it does not work with memory from vmalloc or high memory.

struct page *pfn_to_page(int pfn);
struct page *pfn_to_page(int pfn);

返回struct page给定页框号的指针。如有必要,它会在将页帧号 传递给 pfn_to_page之前使用pfn_valid检查其有效性。

Returns the struct page pointer for the given page frame number. If necessary, it checks a page frame number for validity with pfn_valid before passing it to pfn_to_page.

void *page_address(struct page *page);
void *page_address(struct page *page);

返回该页的内核虚拟地址(如果存在)。对于高端内存,该地址仅在页面已被映射时才存在。该函数在<linux/mm.h>中定义。在大多数情况下,您希望使用kmap版本而不是 page_address

Returns the kernel virtual address of this page, if such an address exists. For high memory, that address exists only if the page has been mapped. This function is defined in <linux/mm.h>. In most situations, you want to use a version of kmap rather than page_address.

#include <linux/highmem.h>

void *kmap(struct page *page);

void kunmap(struct page *page);
#include <linux/highmem.h>

void *kmap(struct page *page);

void kunmap(struct page *page);

公里图 返回系统中任何页面的内核虚拟地址。对于低内存页,它只返回该页的逻辑地址;对于高内存页面, kmap在内核地址空间的专用部分中创建特殊映射。使用kmap创建的映射应始终使用kunmap释放;可用的此类映射数量有限,因此最好不要保留它们太久。 kmap调用维护一个计数器,因此如果两个或多个函数都在同一页面上调用kmap ,就会发生正确的事情。另请注意,kmap如果没有可用的映射,则可以休眠。

kmap returns a kernel virtual address for any page in the system. For low-memory pages, it just returns the logical address of the page; for high-memory pages, kmap creates a special mapping in a dedicated part of the kernel address space. Mappings created with kmap should always be freed with kunmap; a limited number of such mappings is available, so it is better not to hold on to them for too long. kmap calls maintain a counter, so if two or more functions both call kmap on the same page, the right thing happens. Note also that kmap can sleep if no mappings are available.

#include <linux/highmem.h>

#include <asm/kmap_types.h>

void *kmap_atomic(struct page *page, enum km_type type);

void kunmap_atomic(void *addr, enum km_type type);
#include <linux/highmem.h>

#include <asm/kmap_types.h>

void *kmap_atomic(struct page *page, enum km_type type);

void kunmap_atomic(void *addr, enum km_type type);

kmap_atomic是kmap的高性能形式 。每个架构都为原子 kmap 维护一个小的槽列表(专用页表条目);kmap_atomic的调用者 必须告诉系统在参数中使用哪些槽type。对驱动程序有意义的唯一插槽是KM_USER0andKM_USER1(对于直接从用户空间调用运行的代码)、andKM_IRQ0KM_IRQ1(对于中断处理程序)。请注意,原子公里图必须以原子方式处理;你的代码不能在持有一个的情况下休眠。另请注意,内核中没有任何内容可以阻止两个函数尝试使用同一插槽并相互干扰(尽管每个 CPU 都有一组唯一的插槽)。实际上,原子 kmap 槽的争用似乎不是问题。

kmap_atomic is a high-performance form of kmap. Each architecture maintains a small list of slots (dedicated page table entries) for atomic kmaps; a caller of kmap_atomic must tell the system which of those slots to use in the type argument. The only slots that make sense for drivers are KM_USER0 and KM_USER1 (for code running directly from a call from user space), and KM_IRQ0 and KM_IRQ1 (for interrupt handlers). Note that atomic kmaps must be handled atomically; your code cannot sleep while holding one. Note also that nothing in the kernel keeps two functions from trying to use the same slot and interfering with each other (although there is a unique set of slots for each CPU). In practice, contention for atomic kmap slots seems to not be a problem.

我们看到了这些的一些用途 当我们进入本章后面和后续章节的示例代码时,我们会看到函数。

We see some uses of these functions when we get into the example code, later in this chapter and in subsequent chapters.

页表

Page Tables

任何 现代系统中,处理器必须具有将虚拟地址转换为其相应的物理地址的机制。这种机制称为页表;它本质上是一个多级树形结构数组,包含虚拟到物理的映射和一些相关的标志。即使在不直接使用页表的体系结构上,Linux 内核也会维护一组页表。

On any modern system, the processor must have a mechanism for translating virtual addresses into its corresponding physical addresses. This mechanism is called a page table; it is essentially a multilevel tree-structured array containing virtual-to-physical mappings and a few associated flags. The Linux kernel maintains a set of page tables even on architectures that do not use such tables directly.

设备驱动程序通常执行的许多操作可能涉及操作页表。对于驱动程序作者来说幸运的是,2.6 内核已经消除了直接使用页表的任何需要。因此,我们不会详细描述它们;好奇的读者可能想看看Daniel P. Bovet 和 Marco Cesati (O'Reilly) 撰写的Understanding The Linux Kernel来了解完整的故事。

A number of operations commonly performed by device drivers can involve manipulating page tables. Fortunately for the driver author, the 2.6 kernel has eliminated any need to work with page tables directly. As a result, we do not describe them in any detail; curious readers may want to have a look at Understanding The Linux Kernel by Daniel P. Bovet and Marco Cesati (O'Reilly) for the full story.

虚拟内存区域

Virtual Memory Areas

虚拟内存区域(VMA)是 用于管理进程地址空间的不同区域的内核数据结构。VMA 表示进程虚拟内存中的同质区域:具有相同权限标志并由同一对象(例如文件或交换空间)备份的连续虚拟地址范围。它大致对应于“段”的概念,尽管它更适合描述为“具有自己属性的内存对象”。进程的内存映射(至少)由以下区域组成:

The virtual memory area (VMA) is the kernel data structure used to manage distinct regions of a process's address space. A VMA represents a homogeneous region in the virtual memory of a process: a contiguous range of virtual addresses that have the same permission flags and are backed up by the same object (a file, say, or swap space). It corresponds loosely to the concept of a "segment," although it is better described as "a memory object with its own properties." The memory map of a process is made up of (at least) the following areas:

  • 程序可执行代码的区域(通常称为文本)

  • An area for the program's executable code (often called text)

  • 多个数据区域,包括初始化数据(在执行开始时具有显式赋值的数据)、未初始化数据(BSS)、[ 3 ]和程序堆栈

  • Multiple areas for data, including initialized data (that which has an explicitly assigned value at the beginning of execution), uninitialized data (BSS),[3] and the program stack

  • 每个活动内存映射一个区域

  • One area for each active memory mapping

通过查看/proc/ <pid/maps>可以看到进程的内存区域(其中pid当然被进程 ID 替换)。/proc/self是/proc/ pid的特例,因为它始终引用当前进程。作为示例,这里有几个内存映射(我们在其中添加了斜体的简短注释):

The memory areas of a process can be seen by looking in /proc/ <pid/maps> (in which pid, of course, is replaced by a process ID). /proc/self is a special case of /proc/ pid, because it always refers to the current process. As an example, here are a couple of memory maps (to which we have added short comments in italics):

#查看 init 
08048000-0804e000 r-xp 00000000 03:01 64652 /sbin/init  文本
0804e000-0804f000 rw-p 00006000 03:01 64652 /sbin/init  数据
0804f000-08053000 r wxp 00000000 00:00 0          零映射 BSS 
40000000 -40015000 r-xp 00000000 03:01 96278 /lib/ld-2.3.2.so  文本
40015000-40016000 rw-p 00014000 03:01 96278 /lib/ld-2.3.2.so  数据
40016000-40017000 r w-p 00000000 00:00 0           BSS ld.so 
42000000-4212e000 r-xp 00000000 03:01 80290 /lib/tls/libc-2.3.2.so  文本
4212e000-42131000 rw-p 0012e000 03:01 80290 /lib/tl s/libc -2.3.2.so  数据cat /proc/1/maps   
               
42131000-42133000 rw-p 00000000 00:00 0           libc 
bffff000-c0000000 rwxp 00000000 00:00 0          堆栈段
ffffe000-fffff000 ---p 00000000 00:00 0           vsyscall 页的 BSS

# rsh Wolf cat /proc/self/maps #### x86-64(已修剪)
00400000-00405000 r-xp 00000000 03:01 1596291 /bin/cat    文本
00504000-00505000 rw-p 00004000 03:01 1596291 /bin/cat    数据
00505000-00526000 rwxp 00505000 00:00 0                       废话
3252200000-3252214000 r-xp 00000000 03:01 1237890 /lib64/ld-2.3.3.so
3252300000-3252301000 r--p 00100000 03:01 1237890 /lib64/ld-2.3.3.so
3252301000-3252302000 rw-p 00101000 03:01 1237890 /lib64/ld-2.3.3.so
7fbfffe000-7fc0000000 rw-p 7fbfffe000 00:00 0                 堆栈
ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0        vsyscall
# cat /proc/1/maps   
               look at init
08048000-0804e000 r-xp 00000000 03:01 64652      /sbin/init  text
0804e000-0804f000 rw-p 00006000 03:01 64652      /sbin/init  data
0804f000-08053000 rwxp 00000000 00:00 0          zero-mapped BSS
40000000-40015000 r-xp 00000000 03:01 96278      /lib/ld-2.3.2.so  text
40015000-40016000 rw-p 00014000 03:01 96278      /lib/ld-2.3.2.so  data
40016000-40017000 rw-p 00000000 00:00 0          BSS for ld.so
42000000-4212e000 r-xp 00000000 03:01 80290      /lib/tls/libc-2.3.2.so  text
4212e000-42131000 rw-p 0012e000 03:01 80290      /lib/tls/libc-2.3.2.so  data
42131000-42133000 rw-p 00000000 00:00 0          BSS for libc
bffff000-c0000000 rwxp 00000000 00:00 0          Stack segment
ffffe000-fffff000 ---p 00000000 00:00 0          vsyscall page

# rsh wolf cat /proc/self/maps  #### x86-64 (trimmed)
00400000-00405000 r-xp 00000000 03:01 1596291     /bin/cat    text
00504000-00505000 rw-p 00004000 03:01 1596291     /bin/cat    data
00505000-00526000 rwxp 00505000 00:00 0                       bss
3252200000-3252214000 r-xp 00000000 03:01 1237890 /lib64/ld-2.3.3.so
3252300000-3252301000 r--p 00100000 03:01 1237890 /lib64/ld-2.3.3.so
3252301000-3252302000 rw-p 00101000 03:01 1237890 /lib64/ld-2.3.3.so
7fbfffe000-7fc0000000 rw-p 7fbfffe000 00:00 0                 stack
ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0       vsyscall

每行中的字段是:

The fields in each line are:

               start- end perm offset major:minor inode image
               start-end perm offset major:minor inode image

/proc/*/maps中的每个字段 (图像名称除外)对应于以下字段struct vm_area_struct

Each field in /proc/*/maps (except the image name) corresponds to a field in struct vm_area_struct:

start

end
start

end

该内存区域的开始和结束虚拟地址。

The beginning and ending virtual addresses for this memory area.

perm
perm

具有内存区域的读、写和执行权限的位掩码。该字段描述了允许进程对属于该区域的页面执行的操作。该字段中的最后一个字符表示p“私有”或s“共享”。

A bit mask with the memory area's read, write, and execute permissions. This field describes what the process is allowed to do with pages belonging to the area. The last character in the field is either p for "private" or s for "shared."

offset
offset

内存区域在其映射到的文件中开始的位置。偏移量 0表示内存区域的开头对应于文件的开头。

Where the memory area begins in the file that it is mapped to. An offset of 0 means that the beginning of the memory area corresponds to the beginning of the file.

major

minor
major

minor

保存已映射文件的设备的主设备号和次设备号。令人困惑的是,对于设备映射,主编号和次编号指的是保存用户打开的设备特殊文件的磁盘分区,而不是设备本身。

The major and minor numbers of the device holding the file that has been mapped. Confusingly, for device mappings, the major and minor numbers refer to the disk partition holding the device special file that was opened by the user, and not the device itself.

inode
inode

映射文件的索引节点号。

The inode number of the mapped file.

image
image

已映射的文件(通常是可执行映像)的名称。

The name of the file (usually an executable image) that has been mapped.

vm_area_struct结构

The vm_area_struct structure

当用户空间进程调用mmap将设备内存映射到其地址空间时,系统会通过创建一个新的 VMA 来表示该映射来做出响应。支持mmap(因此实现 mmap方法)的驱动程序需要通过完成该 VMA 的初始化来帮助该过程。因此,驱动程序编写者应该至少对 VMA 有最低限度的了解才能支持mmap

When a user-space process calls mmap to map device memory into its address space, the system responds by creating a new VMA to represent that mapping. A driver that supports mmap (and, thus, that implements the mmap method) needs to help that process by completing the initialization of that VMA. The driver writer should, therefore, have at least a minimal understanding of VMAs in order to support mmap.

让我们看看struct vm_area_struct(在<linux/mm.h>中定义)中最重要的字段。这些字段可以由设备驱动程序在其mmap中使用 执行。请注意,内核维护 VMA 的列表和树以优化区域查找,并且vm_area_struct使用几个字段来维护此组织。因此,VMA 不能由驱动程序随意创建,否则结构就会被破坏。VMA 的主要字段如下(注意这些字段与我们刚刚看到的/proc输出之间的相似性):

Let's look at the most important fields in struct vm_area_struct (defined in <linux/mm.h>). These fields may be used by device drivers in their mmap implementation. Note that the kernel maintains lists and trees of VMAs to optimize area lookup, and several fields of vm_area_struct are used to maintain this organization. Therefore, VMAs can't be created at will by a driver, or the structures break. The main fields of VMAs are as follows (note the similarity between these fields and the /proc output we just saw):

unsigned long vm_start;

unsigned long vm_end;
unsigned long vm_start;

unsigned long vm_end;

该VMA覆盖的虚拟地址范围。这些字段是/proc/*/maps中显示的前两个字段。

The virtual address range covered by this VMA. These fields are the first two fields shown in /proc/*/maps.

struct file *vm_file;
struct file *vm_file;

指向与该区域关联的结构的指针struct file(如果有)。

A pointer to the struct file structure associated with this area (if any).

unsigned long vm_pgoff;
unsigned long vm_pgoff;

文件中区域的偏移量(以页为单位)。当映射文件或设备时,这是该区域中映射的第一页的文件位置。

The offset of the area in the file, in pages. When a file or device is mapped, this is the file position of the first page mapped in this area.

unsigned long vm_flags;
unsigned long vm_flags;

描述该区域的一组标志。设备驱动程序编写者最感兴趣的标志是VM_IOVM_RESERVEDVM_IO将 VMA 标记为内存映射 I/O 区域。除此之外,该VM_IO标志还可以防止该区域包含在进程核心转储中。VM_RESERVED告诉内存管理系统不要尝试换出该VMA;它应该在大多数设备映射中设置。

A set of flags describing this area. The flags of the most interest to device driver writers are VM_IO and VM_RESERVED. VM_IO marks a VMA as being a memory-mapped I/O region. Among other things, the VM_IO flag prevents the region from being included in process core dumps. VM_RESERVED tells the memory management system not to attempt to swap out this VMA; it should be set in most device mappings.

struct vm_operations_struct *vm_ops;
struct vm_operations_struct *vm_ops;

内核可以调用来操作该内存区域的一组函数。它的存在表明内存区域是一个内核“对象”,就像 struct file我们在整本书中一直使用的那样。

A set of functions that the kernel may invoke to operate on this memory area. Its presence indicates that the memory area is a kernel "object," like the struct file we have been using throughout the book.

void *vm_private_data;
void *vm_private_data;

驱动程序可以使用它来存储自己的信息的字段。

A field that may be used by the driver to store its own information.

与 一样 struct vm_area_struct,是在<linux/mm.h>vm_operations_struct中定义的;它包括下面列出的操作。这些操作是处理进程内存需求所需的唯一操作,并且它们按照声明的顺序列出。本章稍后将实现其中一些功能。

Like struct vm_area_struct, the vm_operations_struct is defined in <linux/mm.h>; it includes the operations listed below. These operations are the only ones needed to handle the process's memory needs, and they are listed in the order they are declared. Later in this chapter, some of these functions are implemented.

void (*open)(struct vm_area_struct *vma);
void (*open)(struct vm_area_struct *vma);

open 方法由内核调用,以允许实现 VMA 的子系统初始化该区域每当对 VMA 进行新引用时(例如,当进程分叉时),都会调用此方法。当mmap第一次创建 VMA 时会发生一个异常;在这种情况下,将调用驱动程序的mmap方法。

The open method is called by the kernel to allow the subsystem implementing the VMA to initialize the area. This method is invoked any time a new reference to the VMA is made (when a process forks, for example). The one exception happens when the VMA is first created by mmap; in this case, the driver's mmap method is called instead.

void (*close)(struct vm_area_struct *vma);
void (*close)(struct vm_area_struct *vma);

当一个区域被破坏时,内核调用其关闭操作。请注意,没有与 VMA 相关的使用计数;每个使用该区域的进程都会打开和关闭该区域一次。

When an area is destroyed, the kernel calls its close operation. Note that there's no usage count associated with VMAs; the area is opened and closed exactly once by each process that uses it.

struct page *(*nopage)(struct vm_area_struct *vma, unsigned long address, int

*type);
struct page *(*nopage)(struct vm_area_struct *vma, unsigned long address, int

*type);

当进程尝试访问属于有效 VMA 但当前不在内存中的页面时,将为相关区域调用nopage方法(如果已定义)。struct page该方法可能在从辅助存储器读入物理页后返回该指针。如果没有为该区域定义nopage方法,则内核会分配一个空页。

When a process tries to access a page that belongs to a valid VMA, but that is currently not in memory, the nopage method is called (if it is defined) for the related area. The method returns the struct page pointer for the physical page after, perhaps, having read it in from secondary storage. If the nopage method isn't defined for the area, an empty page is allocated by the kernel.

int (*populate)(struct vm_area_struct *vm, unsigned long address, unsigned

long len, pgprot_t prot, unsigned long pgoff, int nonblock);
int (*populate)(struct vm_area_struct *vm, unsigned long address, unsigned

long len, pgprot_t prot, unsigned long pgoff, int nonblock);

此方法允许内核在用户空间访问页面之前将其“预故障”到内存中。那里 通常不需要驱动程序来实现填充 方法。

This method allows the kernel to "prefault" pages into memory before they are accessed by user space. There is generally no need for drivers to implement the populate method.

进程内存映射

The Process Memory Map

最后的记忆 管理难题是进程内存映射结构,它将所有其他数据结构保存在一起。系统中的每个进程(除了一些内核空间帮助线程)都有一个(在<linux/sched.h>struct mm_struct中定义),其中包含进程的虚拟内存区域列表、页表和各种其他内存位管理内务信息,以及信号量 ( ) 和自旋锁 ( )。在任务结构中可以找到指向该结构的指针;在驱动程序需要访问它的极少数情况下,通常的方法是使用mmap_sempage_table_lockcurrent->mm。注意,内存管理结构可以在进程之间共享;例如,Linux 的线程实现就是以这种方式工作的。

The final piece of the memory management puzzle is the process memory map structure, which holds all of the other data structures together. Each process in the system (with the exception of a few kernel-space helper threads) has a struct mm_struct (defined in <linux/sched.h>) that contains the process's list of virtual memory areas, page tables, and various other bits of memory management housekeeping information, along with a semaphore (mmap_sem) and a spinlock (page_table_lock). The pointer to this structure is found in the task structure; in the rare cases where a driver needs to access it, the usual way is to use current->mm. Note that the memory management structure can be shared between processes; the Linux implementation of threads works in this way, for example.

我们对 Linux 内存管理数据结构的概述到此结束。完成这些之后,我们现在可以继续实现mmap系统调用。

That concludes our overview of Linux memory management data structures. With that out of the way, we can now proceed to the implementation of the mmap system call.

mmap设备操作

The mmap Device Operation

内存映射是其中之一 现代 Unix 系统最有趣的功能。就驱动程序而言,可以实现内存映射,为用户程序提供对设备内存的直接访问。

Memory mapping is one of the most interesting features of modern Unix systems. As far as drivers are concerned, memory mapping can be implemented to provide user programs with direct access to device memory.

通过查看 X Window 系统服务器的虚拟内存区域的子集,可以看到mmap使用的明确示例:

A definitive example of mmap usage can be seen by looking at a subset of the virtual memory areas for the X Window System server:

            cat /proc/731/maps
000a0000-000c0000 rwxs 000a0000 03:01 282652 /dev/mem
000f0000-00100000 r-xs 000f0000 03:01 282652 /dev/mem
00400000-005c0000 r-xp 00000000 03:01 1366927 /usr/X11R6/bin/Xorg
006bf000-006f7000 读写-p 001bf000 03:01 1366927 /usr/X11R6/bin/Xorg
2a95828000-2a958a8000 rw-s fcc00000 03:01 282652 /dev/mem
2a958a8000-2a9d8a8000 rw-s e8000000 03:01 282652 /dev/mem
...
            cat /proc/731/maps
000a0000-000c0000 rwxs 000a0000 03:01 282652      /dev/mem
000f0000-00100000 r-xs 000f0000 03:01 282652      /dev/mem
00400000-005c0000 r-xp 00000000 03:01 1366927     /usr/X11R6/bin/Xorg
006bf000-006f7000 rw-p 001bf000 03:01 1366927     /usr/X11R6/bin/Xorg
2a95828000-2a958a8000 rw-s fcc00000 03:01 282652  /dev/mem
2a958a8000-2a9d8a8000 rw-s e8000000 03:01 282652  /dev/mem
...

X 服务器的 VMA 的完整列表很长,但这里对大多数条目不感兴趣。然而,我们确实看到了/dev/mem的四个单独的映射,这让我们可以深入了解 X 服务器如何与显卡一起工作。第一个映射位于a0000,这是 640-KB ISA 孔中视频 RAM 的标准位置。再往下,我们看到 处有一个大映射e8000000,该地址位于系统最高 RAM 地址之上。这是一个直接映射 适配器上的视频内存。

The full list of the X server's VMAs is lengthy, but most of the entries are not of interest here. We do see, however, four separate mappings of /dev/mem, which give some insight into how the X server works with the video card. The first mapping is at a0000, which is the standard location for video RAM in the 640-KB ISA hole. Further down, we see a large mapping at e8000000, an address which is above the highest RAM address on the system. This is a direct mapping of the video memory on the adapter.

这些区域也可以在/proc/iomem中看到:

These regions can also be seen in /proc/iomem:

000a0000-000bffff:视频 RAM 区域
000c0000-000ccfff:视频 ROM
000d1000-000d1fff:适配器 ROM
000f0000-000fffff:系统ROM
d7f00000-f7efffff:PCI 总线#01
  e8000000-efffffff : 0000:01:00.0
fc700000-fccfffff:PCI 总线 #01
  fcc00000-fcc0ffff : 0000:01:00.0
000a0000-000bffff : Video RAM area
000c0000-000ccfff : Video ROM
000d1000-000d1fff : Adapter ROM
000f0000-000fffff : System ROM
d7f00000-f7efffff : PCI Bus #01
  e8000000-efffffff : 0000:01:00.0
fc700000-fccfffff : PCI Bus #01
  fcc00000-fcc0ffff : 0000:01:00.0

映射设备意味着将一系列用户空间地址与设备内存相关联。每当程序在指定的地址范围内读取或写入时,它实际上是在访问设备。在 X 服务器示例中,使用mmap可以快速轻松地访问视频卡的内存。对于像这样的性能关键型应用程序,直接访问会带来很大的不同。

Mapping a device means associating a range of user-space addresses to device memory. Whenever the program reads or writes in the assigned address range, it is actually accessing the device. In the X server example, using mmap allows quick and easy access to the video card's memory. For a performance-critical application like this, direct access makes a large difference.

正如您可能怀疑的那样,并非每个设备都适合mmap抽象;例如,对于串行端口和其他面向流的设备来说,它没有任何意义。mmap的另一个限制是映射是PAGE_SIZE粒度化的。内核只能在页表级别管理虚拟地址;因此,映射区域必须是 的倍数PAGE_SIZE,并且必须位于从 的倍数地址开始的物理内存中PAGE_SIZE。如果区域的大小不是页面大小的倍数,则内核会通过使区域稍大来强制大小粒度。

As you might suspect, not every device lends itself to the mmap abstraction; it makes no sense, for instance, for serial ports and other stream-oriented devices. Another limitation of mmap is that mapping is PAGE_SIZE grained. The kernel can manage virtual addresses only at the level of page tables; therefore, the mapped area must be a multiple of PAGE_SIZE and must live in physical memory starting at an address that is a multiple of PAGE_SIZE. The kernel forces size granularity by making a region slightly bigger if its size isn't a multiple of the page size.

这些限制对于驱动程序来说并不是一个大的限制,因为访问设备的程序无论如何都是设备相关的。由于程序必须了解设备的工作原理,因此程序员不会因为需要查看页面对齐等细节而过分烦恼。当 ISA 设备在某些非 x86 平台上使用时,存在更大的限制,因为它们的 ISA 硬件视图可能不连续。例如,某些 Alpha 计算机将 ISA 内存视为一组分散的 8 位、16 位或 32 位项,没有直接映射。在这种情况下,你不能使用mmap根本不。无法执行 ISA 地址到 Alpha 地址的直接映射是由于两个系统的数据传输规范不兼容。早期的 Alpha 处理器只能发出 32 位和 64 位内存访问,而 ISA 只能执行 8 位和 16 位传输,并且无法将一种协议透明地映射到另一种协议。

These limits are not a big constraint for drivers, because the program accessing the device is device dependent anyway. Since the program must know about how the device works, the programmer is not unduly bothered by the need to see to details like page alignment. A bigger constraint exists when ISA devices are used on some non-x86 platforms, because their hardware view of ISA may not be contiguous. For example, some Alpha computers see ISA memory as a scattered set of 8-bit, 16-bit, or 32-bit items, with no direct mapping. In such cases, you can't use mmap at all. The inability to perform direct mapping of ISA addresses to Alpha addresses is due to the incompatible data transfer specifications of the two systems. Whereas early Alpha processors could issue only 32-bit and 64-bit memory accesses, ISA can do only 8-bit and 16-bit transfers, and there's no way to transparently map one protocol onto the other.

在可行的情况下,使用mmap有明显的优势。例如,我们已经了解了 X 服务器,它在视频内存之间传输大量数据;与lseek / write实现相反,将图形显示映射到用户空间可以显着提高吞吐量 。另一个典型的例子是控制 PCI 设备的程序。大多数 PCI 外设将其控制寄存器映射到内存地址,高性能应用程序可能更喜欢直接访问寄存器,而不是重复调用ioctl来完成其工作。

There are sound advantages to using mmap when it's feasible to do so. For instance, we have already looked at the X server, which transfers a lot of data to and from video memory; mapping the graphic display to user space dramatically improves the throughput, as opposed to an lseek/write implementation. Another typical example is a program controlling a PCI device. Most PCI peripherals map their control registers to a memory address, and a high-performance application might prefer to have direct access to the registers instead of repeatedly having to call ioctl to get its work done.

mmap方法是结构的一部分,并发出mmapfile_operations系统调用时调用 。使用mmap,内核在调用实际方法之前执行大量工作,因此,该方法的原型与系统调用的原型有很大不同。这与ioctlpoll 等调用不同,在这些调用中,内核在调用该方法之前不会执行太多操作。

The mmap method is part of the file_operations structure and is invoked when the mmap system call is issued. With mmap, the kernel performs a good deal of work before the actual method is invoked, and, therefore, the prototype of the method is quite different from that of the system call. This is unlike calls such as ioctl and poll, where the kernel does not do much before calling the method.

系统调用声明如下(如mmap(2) 手册页中所述):

The system call is declared as follows (as described in the mmap(2) manual page):

mmap(caddr_t addr、size_t len、int prot、int flags、int fd、off_t offset)
mmap (caddr_t addr, size_t len, int prot, int flags, int fd, off_t offset)

另一方面,文件操作声明为:

On the other hand, the file operation is declared as:

int (*mmap)(结构文件*filp,结构vm_area_struct *vma);
int (*mmap) (struct file *filp, struct vm_area_struct *vma);

该方法中的参数与第3章filp中介绍的相同,但包含用于访问设备的虚拟地址范围的信息。因此,大部分工作都是由内核完成的;要实现mmap,驱动程序只需为地址范围构建合适的页表,并在必要时用一组新的操作替换。vmavma->vm_ops

The filp argument in the method is the same as that introduced in Chapter 3, while vma contains the information about the virtual address range that is used to access the device. Therefore, much of the work has been done by the kernel; to implement mmap, the driver only has to build suitable page tables for the address range and, if necessary, replace vma->vm_ops with a new set of operations.

构建页表有两种方法:使用调用的函数一次性完成所有工作,或者通过nopageremap_pfn_range VMA 方法一次完成一页。每种方法都有其优点和局限性。我们从“一次性全部”方法开始,这种方法更简单。从那里,我们添加了现实世界实施所需的复杂性。

There are two ways of building the page tables: doing it all at once with a function called remap_pfn_range or doing it a page at a time via the nopage VMA method. Each method has its advantages and limitations. We start with the "all at once" approach, which is simpler. From there, we add the complications needed for a real-world implementation.

使用 remap_pfn_range

Using remap_pfn_range

建立新页面的工作 映射物理地址范围的表由remap_pfn_rangeio_remap_page_range处理,它们具有以下原型:

The job of building new page tables to map a range of physical addresses is handled by remap_pfn_range and io_remap_page_range, which have the following prototypes:

int remap_pfn_range(struct vm_area_struct *vma,
                     无符号长 virt_addr、无符号长 pfn、
                     无符号长尺寸,pgprot_t prot);
int io_remap_page_range(struct vm_area_struct *vma,
                        无符号长 virt_addr、无符号长 phys_addr、
                        无符号长尺寸,pgprot_t prot);
int remap_pfn_range(struct vm_area_struct *vma, 
                     unsigned long virt_addr, unsigned long pfn,
                     unsigned long size, pgprot_t prot);
int io_remap_page_range(struct vm_area_struct *vma, 
                        unsigned long virt_addr, unsigned long phys_addr,
                        unsigned long size, pgprot_t prot);

函数返回的值是通常的0 错误代码或负错误代码。让我们看看函数参数的确切含义:

The value returned by the function is the usual 0 or a negative error code. Let's look at the exact meaning of the function's arguments:

vma
vma

页范围映射到的虚拟内存区域。

The virtual memory area into which the page range is being mapped.

virt_addr
virt_addr

应开始重新映射的用户虚拟地址。virt_addr该函数为和之间的虚拟地址范围构建页表virt_addr+size

The user virtual address where remapping should begin. The function builds page tables for the virtual address range between virt_addr and virt_addr+size.

pfn
pfn

虚拟地址应映射到的物理地址对应的页帧号。页框号只是按位右移的物理地址PAGE_SHIFT。对于大多数用途,vm_pgoffVMA 结构的字段恰好包含您需要的值。(pfn<<PAGE_SHIFT)该函数影响从到 的物理地址(pfn<<PAGE_SHIFT)+size

The page frame number corresponding to the physical address to which the virtual address should be mapped. The page frame number is simply the physical address right-shifted by PAGE_SHIFT bits. For most uses, the vm_pgoff field of the VMA structure contains exactly the value you need. The function affects physical addresses from (pfn<<PAGE_SHIFT) to (pfn<<PAGE_SHIFT)+size.

size
size

正在重新映射的区域的维度(以字节为单位)。

The dimension, in bytes, of the area being remapped.

prot
prot

新VMA要求的“保护”。驱动程序可以(并且应该)使用 中找到的值vma->vm_page_prot

The "protection" requested for the new VMA. The driver can (and should) use the value found in vma->vm_page_prot.

remap_pfn_range的参数相当简单,并且当调用mmap方法时,其中大部分已在 VMA 中提供给您。然而,您可能想知道为什么有两个函数。第一个 ( remap_pfn_range ) 适用于pfn引用实际系统 RAM 的情况,而 当指向 I/O 内存时应使用io_remap_page_range 。phys_addr实际上,除了 SPARC 之外,这两个函数在每个体系结构上都是相同的,并且您会看到在大多数情况下都使用remap_pfn_range 。但是,为了编写便携式驱动程序,您应该使用以下变体remap_pfn_range适合您的特定情况。

The arguments to remap_pfn_range are fairly straightforward, and most of them are already provided to you in the VMA when your mmap method is called. You may be wondering why there are two functions, however. The first (remap_pfn_range) is intended for situations where pfn refers to actual system RAM, while io_remap_page_range should be used when phys_addr points to I/O memory. In practice, the two functions are identical on every architecture except the SPARC, and you see remap_pfn_range used in most situations. In the interest of writing portable drivers, however, you should use the variant of remap_pfn_range that is suited to your particular situation.

另一种复杂情况与缓存有关:通常,处理器不应缓存对设备内存的引用。通常,系统 BIOS 会正确设置,但也可以通过保护字段禁用特定 VMA 的缓存。不幸的是,在此级别禁用缓存高度依赖于处理器。好奇的读者可能希望查看drivers/char/mem.c中的 pgprot_noncached函数来了解其中涉及的内容。我们不会在这里进一步讨论这个话题。

One other complication has to do with caching: usually, references to device memory should not be cached by the processor. Often the system BIOS sets things up properly, but it is also possible to disable caching of specific VMAs via the protection field. Unfortunately, disabling caching at this level is highly processor dependent. The curious reader may wish to look at the pgprot_noncached function from drivers/char/mem.c to see what's involved. We won't discuss the topic further here.

一个简单的实现

A Simple Implementation

如果您的驱动程序需要将设备内存简单地线性映射到用户地址空间,则remap_pfn_range几乎是您真正需要完成的工作。以下代码源自drivers/char/mem.c ,显示了如何在名为simple (Simple Implement Mapping Pages with Little Enthusiasm)的典型模块中执行此任务:

If your driver needs to do a simple, linear mapping of device memory into a user address space, remap_pfn_range is almost all you really need to do the job. The following code is derived from drivers/char/mem.c and shows how this task is performed in a typical module called simple (Simple Implementation Mapping Pages with Little Enthusiasm):

static int simple_remap_mmap(struct file *filp, struct vm_area_struct *vma)
{
    if (remap_pfn_range(vma, vma->vm_start, vm->vm_pgoff,
                vma->vm_end - vma->vm_start,
                vma->vm_page_prot))
        返回-EAGAIN;

    vma->vm_ops = &simple_remap_vm_ops;
    simple_vma_open(vma);
    返回0;
}
static int simple_remap_mmap(struct file *filp, struct vm_area_struct *vma)
{
    if (remap_pfn_range(vma, vma->vm_start, vm->vm_pgoff,
                vma->vm_end - vma->vm_start,
                vma->vm_page_prot))
        return -EAGAIN;

    vma->vm_ops = &simple_remap_vm_ops;
    simple_vma_open(vma);
    return 0;
}

正如您所看到的,重新映射内存只需调用 remap_pfn_range来创建必要的页表。

As you can see, remapping memory just a matter of calling remap_pfn_range to create the necessary page tables.

添加VMA操作

Adding VMA Operations

正如我们所看到的,该 vm_area_struct结构包含一组可应用于 VMA 的操作。 现在我们看看以简单的方式提供这些操作。特别是,我们 为 VMA提供打开关闭操作。每当进程打开或关闭 VMA 时,就会调用这些操作;特别是, 只要进程分叉并创建对 VMA 的新引用,就会调用open方法。除了内核执行的处理之外,还调用 open 和 close VMA 方法,因此它们不需要重新实现在那里完成的任何工作它们的存在是为了让驱动程序进行他们可能需要的任何额外处理。

As we have seen, the vm_area_struct structure contains a set of operations that may be applied to the VMA. Now we look at providing those operations in a simple way. In particular, we provide open and close operations for our VMA. These operations are called whenever a process opens or closes the VMA; in particular, the open method is invoked anytime a process forks and creates a new reference to the VMA. The open and close VMA methods are called in addition to the processing performed by the kernel, so they need not reimplement any of the work done there. They exist as a way for drivers to do any additional processing that they may require.

事实证明,简单的驱动程序(例如simple)不需要特别执行任何额外的处理。因此,我们创建了openclose方法,它们将一条消息打印到系统日志,通知全世界它们已被调用。不是特别有用,但它确实允许我们展示如何提供这些方法,并查看它们何时被调用。

As it turns out, a simple driver such as simple need not do any extra processing in particular. So we have created open and close methods, which print a message to the system log informing the world that they have been called. Not particularly useful, but it does allow us to show how these methods can be provided, and see when they are invoked.

为此,我们使用调用printkvma->vm_ops 的操作覆盖默认值:

To this end, we override the default vma->vm_ops with operations that call printk:

void simple_vma_open(struct vm_area_struct *vma)
{
    printk(KERN_NOTICE "简单 VMA 打开,virt %lx,phys %lx\n",
            vma->vm_start, vma->vm_pgoff << PAGE_SHIFT);
}

void simple_vma_close(struct vm_area_struct *vma)
{
    printk(KERN_NOTICE "简单 VMA 关闭。\n");
}

静态结构 vm_operations_struct simple_remap_vm_ops = {
    .open = simple_vma_open,
    .close = simple_vma_close,
};
void simple_vma_open(struct vm_area_struct *vma)
{
    printk(KERN_NOTICE "Simple VMA open, virt %lx, phys %lx\n",
            vma->vm_start, vma->vm_pgoff << PAGE_SHIFT);
}

void simple_vma_close(struct vm_area_struct *vma)
{
    printk(KERN_NOTICE "Simple VMA close.\n");
}

static struct vm_operations_struct simple_remap_vm_ops = {
    .open =  simple_vma_open,
    .close = simple_vma_close,
};

为了使这些操作对特定映射有效,需要simple_remap_vm_opsvm_ops相关 VMA 的字段中存储一个指针。这通常是在 mmap方法中完成的。如果您返回 simple_remap_mmap示例,您会看到以下代码行:

To make these operations active for a specific mapping, it is necessary to store a pointer to simple_remap_vm_ops in the vm_ops field of the relevant VMA. This is usually done in the mmap method. If you turn back to the simple_remap_mmap example, you see these lines of code:

vma->vm_ops = &simple_remap_vm_ops;
simple_vma_open(vma);
vma->vm_ops = &simple_remap_vm_ops;
simple_vma_open(vma);

请注意对simple_vma_open 的显式调用。由于 open方法没有在初始 mmap上调用,因此如果我们希望它运行,我们必须显式调用它。

Note the explicit call to simple_vma_open. Since the open method is not invoked on the initial mmap, we must call it explicitly if we want it to run.

使用 nopage 映射内存

Mapping Memory with nopage

虽然remap_pfn_range有效 对于许多(如果不是大多数)驱动程序 mmap实现来说,有时有必要更加灵活一些。在这种情况下,可能需要使用nopage VMA 方法来实现。

Although remap_pfn_range works well for many, if not most, driver mmap implementations, sometimes it is necessary to be a little more flexible. In such situations, an implementation using the nopage VMA method may be called for.

nopage方法有用的一种情况是由mremap系统调用引起的,应用程序使用该调用来更改映射区域的边界地址。事实上,当mremap更改映射的 VMA 时,内核不会直接通知驱动程序。如果 VMA 的大小减小,内核可以悄悄地清除不需要的页面,而无需通知驱动程序。相反,如果扩展了 VMA,则驱动程序最终会通过调用nopage来发现何时必须为新页面设置映射,因此无需执行单独的通知。没有页面 因此,如果要支持 mremap系统调用,则必须实现该方法。在这里,我们展示了简单设备的nopage的简单实现 。

One situation in which the nopage approach is useful can be brought about by the mremap system call, which is used by applications to change the bounding addresses of a mapped region. As it happens, the kernel does not notify drivers directly when a mapped VMA is changed by mremap. If the VMA is reduced in size, the kernel can quietly flush out the unwanted pages without telling the driver. If, instead, the VMA is expanded, the driver eventually finds out by way of calls to nopage when mappings must be set up for the new pages, so there is no need to perform a separate notification. The nopage method, therefore, must be implemented if you want to support the mremap system call. Here, we show a simple implementation of nopage for the simple device.

没有页面 请记住,方法具有以下原型:

The nopage method, remember, has the following prototype:

结构页 *(*nopage)(struct vm_area_struct *vma,
                       无符号长地址,int *类型);
struct page *(*nopage)(struct vm_area_struct *vma, 
                       unsigned long address, int *type);

当用户进程尝试访问内存中不存在的 VMA 中的页面时, 将调用关联的nopage函数。address参数包含导致故障的虚拟地址,向下舍入到页的开头。nopage函数必须定位并返回struct page指向用户想要的页面的指针。此函数还必须注意通过调用get_page宏来增加其返回的页面的使用计数:

When a user process attempts to access a page in a VMA that is not present in memory, the associated nopage function is called. The address parameter contains the virtual address that caused the fault, rounded down to the beginning of the page. The nopage function must locate and return the struct page pointer that refers to the page the user wanted. This function must also take care to increment the usage count for the page it returns by calling the get_page macro:

get_page(struct page *pageptr);
 get_page(struct page *pageptr);

为了保持映射页面上的引用计数正确,此步骤是必要的。内核为每个页面维护这个计数;当计数达到 时0,内核知道该页可能会被放入空闲列表中。当 VMA 取消映射时,内核会减少该区域中每个页面的使用计数。如果您的驱动程序在向该区域添加页面时不增加计数,则使用计数会0过早进行,并且系统的完整性会受到损害。

This step is necessary to keep the reference counts correct on the mapped pages. The kernel maintains this count for every page; when the count goes to 0, the kernel knows that the page may be placed on the free list. When a VMA is unmapped, the kernel decrements the usage count for every page in the area. If your driver does not increment the count when adding a page to the area, the usage count becomes 0 prematurely, and the integrity of the system is compromised.

nopage方法还应该将错误类型存储在参数指向的位置中,但前提type是该参数不是NULL。在设备驱动程序中, 的正确值type始终是VM_FAULT_MINOR

The nopage method should also store the type of fault in the location pointed to by the type argument—but only if that argument is not NULL. In device drivers, the proper value for type will invariably be VM_FAULT_MINOR.

如果您使用nopage ,则调用mmap时通常只需要做很少的工作;我们的版本如下所示:

If you are using nopage, there is usually very little work to be done when mmap is called; our version looks like this:

static int simple_nopage_mmap(结构文件 *filp, 结构 vm_area_struct *vma)
{
    无符号长偏移量 = vma->vm_pgoff << PAGE_SHIFT;

    if (offset >= _ _pa(high_memory) || (filp->f_flags & O_SYNC))
        vma->vm_flags |= VM_IO;
    vma->vm_flags |= VM_RESERVED;

    vma->vm_ops = &simple_nopage_vm_ops;
    simple_vma_open(vma);
    返回0;
}
static int simple_nopage_mmap(struct file *filp, struct vm_area_struct *vma)
{
    unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;

    if (offset >= _ _pa(high_memory) || (filp->f_flags & O_SYNC))
        vma->vm_flags |= VM_IO;
    vma->vm_flags |= VM_RESERVED;

    vma->vm_ops = &simple_nopage_vm_ops;
    simple_vma_open(vma);
    return 0;
}

mmap要做的主要事情就是用我们自己的操作替换默认的 ( NULL)指针。vm_ops然后, nopage方法负责一次“重新映射”一页并返回其结构的地址struct page。因为我们只是在这里实现一个到物理内存的窗口,所以重新映射步骤很简单:我们只需要找到并返回一个指向所需struct page地址的指针。我们的 nopage方法如下所示:

The main thing mmap has to do is to replace the default (NULL) vm_ops pointer with our own operations. The nopage method then takes care of "remapping" one page at a time and returning the address of its struct page structure. Because we are just implementing a window onto physical memory here, the remapping step is simple: we only need to locate and return a pointer to the struct page for the desired address. Our nopage method looks like the following:

结构页 *simple_vma_nopage(结构 vm_area_struct *vma,
                无符号长地址,int *类型)
{
    结构页 *pageptr;
    无符号长偏移量 = vma->vm_pgoff << PAGE_SHIFT;
    unsigned long physaddr = 地址 - vma->vm_start + 偏移量;
    无符号长页框 = physaddr >> PAGE_SHIFT;

    if (!pfn_valid(页框))
        返回NOPAGE_SIGBUS;
    pageptr = pfn_to_page(pageframe);
    get_page(pageptr);
    如果(类型)
        *类型=VM_FAULT_MINOR;
    返回页指针;
}
struct page *simple_vma_nopage(struct vm_area_struct *vma,
                unsigned long address, int *type)
{
    struct page *pageptr;
    unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
    unsigned long physaddr = address - vma->vm_start + offset;
    unsigned long pageframe = physaddr >> PAGE_SHIFT;

    if (!pfn_valid(pageframe))
        return NOPAGE_SIGBUS;
    pageptr = pfn_to_page(pageframe);
    get_page(pageptr);
    if (type)
        *type = VM_FAULT_MINOR;
    return pageptr;
}

因为我们再次在这里简单地映射主内存,所以 nopage函数只需要找到错误地址的正确struct page地址并增加其引用计数。因此,所需的事件顺序是计算所需的物理地址,并通过右移位将其转换为页框号PAGE_SHIFT。由于用户空间可以给我们任何它喜欢的地址,所以我们必须确保我们有一个有效的页面框架;pfn_valid函数 为我们做到了这一点。如果地址超出范围,我们返回NOPAGE_SIGBUS,这会导致总线信号被传递到调用进程。否则,pfn_to_page获取必要的 struct page指针; 我们可以增加它的引用计数(通过调用get_page)并返回它。

Since, once again, we are simply mapping main memory here, the nopage function need only find the correct struct page for the faulting address and increment its reference count. Therefore, the required sequence of events is to calculate the desired physical address, and turn it into a page frame number by right-shifting it PAGE_SHIFT bits. Since user space can give us any address it likes, we must ensure that we have a valid page frame; the pfn_valid function does that for us. If the address is out of range, we return NOPAGE_SIGBUS, which causes a bus signal to be delivered to the calling process. Otherwise, pfn_to_page gets the necessary struct page pointer; we can increment its reference count (with a call to get_page) and return it.

nopage方法通常返回一个指向struct page. 如果由于某种原因,无法返回正常页面(例如,请求的地址超出了设备的内存区域),NOPAGE_SIGBUS可以返回以发出错误信号;这就是上面简单代码的作用。nopage还可以返回NOPAGE_OOM以指示由于资源限制而导致的失败。

The nopage method normally returns a pointer to a struct page. If, for some reason, a normal page cannot be returned (e.g., the requested address is beyond the device's memory region), NOPAGE_SIGBUS can be returned to signal the error; that is what the simple code above does. nopage can also return NOPAGE_OOM to indicate failures caused by resource limitations.

请注意,此实现适用于 ISA 内存区域,但不适用于 PCI 总线上的内存区域。PCI 内存映射到最高系统内存之上,并且系统内存映射中没有这些地址的条目。因为没有struct page可返回的指针,所以在这些情况下不能使用nopage ;您必须使用remap_pfn_range 代替。

Note that this implementation works for ISA memory regions but not for those on the PCI bus. PCI memory is mapped above the highest system memory, and there are no entries in the system memory map for those addresses. Because there is no struct page to return a pointer to, nopage cannot be used in these situations; you must use remap_pfn_range instead.

如果使用nopageNULL方法,则处理页面错误的内核代码会将零页映射到错误虚拟地址。零是写入时复制页,读取时0用于映射 BSS 段。任何引用零页的进程都会看到这样的情况:一个充满零的页。如果进程写入该页面,它最终会修改私有副本。因此,如果进程通过调用mremap扩展映射区域,并且驱动程序尚未实现nopage,则进程最终会出现零填充内存,而不是分段错误。

If the nopage method is left NULL, kernel code that handles page faults maps the zero page to the faulting virtual address. The zero page is a copy-on-write page that reads as 0 and that is used, for example, to map the BSS segment. Any process referencing the zero page sees exactly that: a page filled with zeroes. If the process writes to the page, it ends up modifying a private copy. Therefore, if a process extends a mapped region by calling mremap, and the driver hasn't implemented nopage, the process ends up with zero-filled memory instead of a segmentation fault.

重新映射特定 I/O 区域

Remapping Specific I/O Regions

我们见过的所有例子都是如此 目前为止是/dev/mem的重新实现;他们将物理地址重新映射到用户空间。然而,典型的驱动程序只想映射适用于其外围设备的小地址范围,而不是所有内存。为了仅将整个内存范围的子集映射到用户空间,驱动程序只需要处理偏移量。以下为驱动程序映射字节区域的技巧,从物理地址(应该是页对齐的)simple_region_size开始 :simple_region_start

All the examples we've seen so far are reimplementations of /dev/mem; they remap physical addresses into user space. The typical driver, however, wants to map only the small address range that applies to its peripheral device, not all memory. In order to map to user space only a subset of the whole memory range, the driver needs only to play with the offsets. The following does the trick for a driver mapping a region of simple_region_size bytes, beginning at physical address simple_region_start (which should be page-aligned):

无符号长关闭 = vma->vm_pgoff << PAGE_SHIFT;
无符号长物理= simple_region_start + off;
无符号长vsize = vma->vm_end - vma->vm_start;
无符号长 psize = simple_region_size - 关闭;

if (vsize > psize)
    返回-EINVAL;/* 跨度太高 */
remap_pfn_range(vma, vma_>vm_start, 物理, vsize, vma->vm_page_prot);
unsigned long off = vma->vm_pgoff << PAGE_SHIFT;
unsigned long physical = simple_region_start + off;
unsigned long vsize = vma->vm_end - vma->vm_start;
unsigned long psize = simple_region_size - off;

if (vsize > psize)
    return -EINVAL; /*  spans too high */
remap_pfn_range(vma, vma_>vm_start, physical, vsize, vma->vm_page_prot);

除了计算偏移量之外,此代码还引入了一项检查,当程序尝试映射比目标设备 I/O 区域中可用的内存更多的内存时,该检查会报告错误。这段代码中,psize是指定偏移量后剩余的物理 I/O 大小,vsize是请求的虚拟内存大小;该函数拒绝映射超出允许内存范围的地址。

In addition to calculating the offsets, this code introduces a check that reports an error when the program tries to map more memory than is available in the I/O region of the target device. In this code, psize is the physical I/O size that is left after the offset has been specified, and vsize is the requested size of virtual memory; the function refuses to map addresses that extend beyond the allowed memory range.

请注意,用户进程始终可以使用mremap来扩展其映射,可能超出物理设备区域的末尾。如果您的驱动程序无法定义 nopage方法,则永远不会通知此扩展,并且附加区域映射到零页。作为驱动程序编写者,您可能很想防止这种行为;将零页映射到区域的末尾并不是一件明显的坏事,但程序员不太可能希望这种情况发生。

Note that the user process can always use mremap to extend its mapping, possibly past the end of the physical device area. If your driver fails to define a nopage method, it is never notified of this extension, and the additional area maps to the zero page. As a driver writer, you may well want to prevent this sort of behavior; mapping the zero page onto the end of your region is not an explicitly bad thing to do, but it is highly unlikely that the programmer wanted that to happen.

防止映射扩展的最简单方法是实现一个简单的 nopage方法,该方法始终会导致将总线信号发送到故障进程。这样的方法看起来像这样:

The simplest way to prevent extension of the mapping is to implement a simple nopage method that always causes a bus signal to be sent to the faulting process. Such a method would look like this:

结构页 *simple_nopage(结构 vm_area_struct *vma,
                           无符号长地址,int *类型);
{ 返回 NOPAGE_SIGBUS; /* 发送 SIGBUS */}
struct page *simple_nopage(struct vm_area_struct *vma,
                           unsigned long address, int *type);
{ return NOPAGE_SIGBUS; /* send a SIGBUS */}

正如我们所看到的,仅当进程取消引用位于已知 VMA 内但当前没有有效页表条目的地址时,才会调用nopage方法。如果我们使用remap_pfn_range 来映射整个设备区域,则仅针对该区域之外的引用调用此处显示的nopage方法。因此,它可以安全地返回NOPAGE_SIGBUS以发出错误信号。当然,更彻底的nopage实现可以检查故障地址是否在设备区域内,如果是,则执行重新映射。然而,再一次没有页面不适用于 PCI 内存区域,因此无法扩展 PCI 映射。

As we have seen, the nopage method is called only when the process dereferences an address that is within a known VMA but for which there is currently no valid page table entry. If we have used remap_pfn_range to map the entire device region, the nopage method shown here is called only for references outside of that region. Thus, it can safely return NOPAGE_SIGBUS to signal an error. Of course, a more thorough implementation of nopage could check to see whether the faulting address is within the device area, and perform the remapping if that is the case. Once again, however, nopage does not work with PCI memory areas, so extension of PCI mappings is not possible.

重新映射内存

Remapping RAM

一个有趣的限制 重新映射_pfn_范围 的一点是它只允许访问物理内存顶部之上的保留页和物理地址。在Linux中,物理地址页在内存映射中被标记为“保留”,以指示它不可用于内存管理。例如,在 PC 上,640 KB 到 1 MB 之间的范围被标记为保留,托管内核代码本身的页面也是如此。保留页被锁定在内存中,并且是唯一可以安全映射到用户空间的页;这个限制是系统稳定性的基本要求。

An interesting limitation of remap_pfn_range is that it gives access only to reserved pages and physical addresses above the top of physical memory. In Linux, a page of physical addresses is marked as "reserved" in the memory map to indicate that it is not available for memory management. On the PC, for example, the range between 640 KB and 1 MB is marked as reserved, as are the pages that host the kernel code itself. Reserved pages are locked in memory and are the only ones that can be safely mapped to user space; this limitation is a basic requirement for system stability.

因此,remap_pfn_range不允许您重新映射常规地址,其中包括通过调用 get_free_page获得的地址。相反,它映射到零页。一切似乎都正常,但进程看到的是私有的、零填充的页面,而不是它所希望的重新映射的 RAM。尽管如此,该函数可以完成大多数硬件驱动程序需要它完成的所有工作,因为它可以重新映射高 PCI 缓冲区和 ISA 内存。

Therefore, remap_pfn_range won't allow you to remap conventional addresses, which include the ones you obtain by calling get_free_page. Instead, it maps in the zero page. Everything appears to work, with the exception that the process sees private, zero-filled pages rather than the remapped RAM that it was hoping for. Nonetheless, the function does everything that most hardware drivers need it to do, because it can remap high PCI buffers and ISA memory.

通过运行mapper可以看到 remap_pfn_range的限制,mapper 是 O'Reilly 的 FTP 站点上提供的文件中的Misc-progs中的示例程序之一。 mapper是一个简单的工具,可以用来快速测试 mmap系统调用;它映射由命令行选项指定的文件的只读部分,并将映射区域转储到标准输出。例如,下面的会话显示/dev/mem没有映射位于地址 64 KB 的物理页,相反,我们看到一个全是零的页(本例中的主机是 PC,但结果是其他平台也一样):

The limitations of remap_pfn_range can be seen by running mapper, one of the sample programs in misc-progs in the files provided on O'Reilly's FTP site. mapper is a simple tool that can be used to quickly test the mmap system call; it maps read-only parts of a file specified by command-line options and dumps the mapped region to standard output. The following session, for instance, shows that /dev/mem doesn't map the physical page located at address 64 KB—instead, we see a page full of zeros (the host computer in this example is a PC, but the result would be the same on other platforms):

morgana.root# ./mapper /dev/mem 0x10000 0x1000 | od-Ax-t x1
将“/dev/mem”从 65536 映射到 69632
000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
001000
morgana.root# ./mapper /dev/mem 0x10000 0x1000 | od -Ax -t x1
mapped "/dev/mem" from 65536 to 69632
000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
001000

remap_pfn_range无法处理 RAM,这表明像scull这样的基于内存的设备无法轻松实现 mmap,因为它的设备内存是传统 RAM,而不是 I/O 内存。幸运的是,任何需要将 RAM 映射到用户空间的驱动程序都可以使用相对简单的解决方法;它使用我们之前看到的nopage方法。

The inability of remap_pfn_range to deal with RAM suggests that memory-based devices like scull can't easily implement mmap, because its device memory is conventional RAM, not I/O memory. Fortunately, a relatively easy workaround is available to any driver that needs to map RAM into user space; it uses the nopage method that we have seen earlier.

使用 nopage 方法重新映射 RAM

Remapping RAM with the nopage method

将真实RAM映射给用户的方法 空间是用来 vm_ops->nopage一次处理一个页面错误的。示例实现是第 8 章中介绍的scullp模块的一部分。

The way to map real RAM to user space is to use vm_ops->nopage to deal with page faults one at a time. A sample implementation is part of the scullp module, introduced in Chapter 8.

scullp是一个面向页面的字符设备。因为它是面向页的,所以它可以在其内存上实现mmap 。实现内存映射的代码使用了15.1 节中介绍的一些概念。

scullp is a page-oriented char device. Because it is page oriented, it can implement mmap on its memory. The code implementing memory mapping uses some of the concepts introduced in Section 15.1.

在检查代码之前,让我们看看影响 scullpmmap实现的设计选择

Before examining the code, let's look at the design choices that affect the mmap implementation in scullp :

  • 只要设备被映射, scullp就不会释放设备内存。这是一个策略问题而不是要求,它与scull和类似设备的行为不同0在打开写入时被截断为长度拒绝释放映射的scullp设备允许进程覆盖另一个进程主动映射的区域,因此您可以测试并查看进程和设备内存如何交互。为了避免释放映射设备,驱动程序必须保留活动映射的计数;设备结构中的字段vmas用于此目的。

  • scullp doesn't release device memory as long as the device is mapped. This is a matter of policy rather than a requirement, and it is different from the behavior of scull and similar devices, which are truncated to a length of 0 when opened for writing. Refusing to free a mapped scullp device allows a process to overwrite regions actively mapped by another process, so you can test and see how processes and device memory interact. To avoid releasing a mapped device, the driver must keep a count of active mappings; the vmas field in the device structure is used for this purpose.

  • 仅当scullp order参数(在模块加载时设置)为 时才 执行内存映射0。该参数控制如何调用__get_free_pages (参见第8.3节)。零阶限制(强制一次分配一个页面,而不是在更大的组中分配)由_ _get_free_pages的内部结构决定,该函数是scullp使用的分配函数。为了最大限度地提高分配性能,Linux 内核为每个分配顺序维护一个空闲页面列表,并且只有簇中第一个页面的引用计数会增加 get_free_pages并减少 free_pages。如果分配顺序大于零,则对scullp设备禁用 mmap 方法,因为nopage处理单个页面而不是页面簇scullp根本不知道如何正确管理属于高阶分配的页面的引用计数。(如果您需要回顾一下scullp和内存分配顺序值,请返回第 8.3.1 节。)

  • Memory mapping is performed only when the scullp order parameter (set at module load time) is 0. The parameter controls how _ _get_free_pages is invoked (see Section 8.3). The zero-order limitation (which forces pages to be allocated one at a time, rather than in larger groups) is dictated by the internals of _ _get_free_pages, the allocation function used by scullp. To maximize allocation performance, the Linux kernel maintains a list of free pages for each allocation order, and only the reference count of the first page in a cluster is incremented by get_free_pages and decremented by free_pages. The mmap method is disabled for a scullp device if the allocation order is greater than zero, because nopage deals with single pages rather than clusters of pages. scullp simply does not know how to properly manage reference counts for pages that are part of higher-order allocations. (Return to Section 8.3.1 if you need a refresher on scullp and the memory allocation order value.)

零阶限制主要是为了保持代码简单。通过使用页面的使用计数,可以正确实现多页面分配的 mmap,但这只会增加示例的复杂性,而不会引入任何有趣的 信息

The zero-order limitation is mostly intended to keep the code simple. It is possible to correctly implement mmap for multipage allocations by playing with the usage count of the pages, but it would only add to the complexity of the example without introducing any interesting information.

旨在根据刚刚概述的规则映射 RAM 的代码需要实现openclosenopage VMA 方法;它还需要访问内存映射来调整页面使用计数。

Code that is intended to map RAM according to the rules just outlined needs to implement the open, close, and nopage VMA methods; it also needs to access the memory map to adjust the page usage counts.

scullp_mmap的这个实现非常短,因为它依赖nopage函数来完成所有有趣的工作:

This implementation of scullp_mmap is very short, because it relies on the nopage function to do all the interesting work:

int scullp_mmap(结构文件 *filp, 结构 vm_area_struct *vma)
{
    struct inode *inode = filp->f_dentry->d_inode;

    /* 如果 order 不为 0 则拒绝映射 */
    if (scullp_devices[iminor(inode)].order)
        返回-ENODEV;

    /* 这里什么都不做:“nopage”将填补漏洞 */
    vma->vm_ops = &scullp_vm_ops;
    vma->vm_flags |= VM_RESERVED;
    vma->vm_private_data = filp->private_data;
    scullp_vma_open(vma);
    返回0;
}
int scullp_mmap(struct file *filp, struct vm_area_struct *vma)
{
    struct inode *inode = filp->f_dentry->d_inode;

    /* refuse to map if order is not 0 */
    if (scullp_devices[iminor(inode)].order)
        return -ENODEV;

    /* don't do anything here: "nopage" will fill the holes */
    vma->vm_ops = &scullp_vm_ops;
    vma->vm_flags |= VM_RESERVED;
    vma->vm_private_data = filp->private_data;
    scullp_vma_open(vma);
    return 0;
}

该语句的目的if是避免映射分配顺序不是的设备0scullp的操作存储在该vm_ops字段中,并且指向设备结构的指针存储在该 vm_private_data字段中。最后,vm_ops->open调用它来更新设备的活动映射计数。

The purpose of the if statement is to avoid mapping devices whose allocation order is not 0. scullp's operations are stored in the vm_ops field, and a pointer to the device structure is stashed in the vm_private_data field. At the end, vm_ops->open is called to update the count of active mappings for the device.

openclose只是跟踪映射计数,定义如下:

open and close simply keep track of the mapping count and are defined as follows:

无效 scullp_vma_open(结构 vm_area_struct *vma)
{
    struct scullp_dev *dev = vma->vm_private_data;

    开发->vmas++;
}

void scullp_vma_close(struct vm_area_struct *vma)
{
    struct scullp_dev *dev = vma->vm_private_data;

    dev->vmas--;
}
void scullp_vma_open(struct vm_area_struct *vma)
{
    struct scullp_dev *dev = vma->vm_private_data;

    dev->vmas++;
}

void scullp_vma_close(struct vm_area_struct *vma)
{
    struct scullp_dev *dev = vma->vm_private_data;

    dev->vmas--;
}

大部分工作由nopage执行。在 scullp实现中,nopageaddress的参数用于计算设备中的偏移量;然后使用偏移量在scullp内存树中查找正确的页面 :

Most of the work is then performed by nopage. In the scullp implementation, the address parameter to nopage is used to calculate an offset into the device; the offset is then used to look up the correct page in the scullp memory tree:

结构页 *scullp_vma_nopage(结构 vm_area_struct *vma,
                                无符号长地址,int *类型)
{
    无符号长偏移量;
    struct scullp_dev *ptr, *dev = vma->vm_private_data;
    结构页 *page = NOPAGE_SIGBUS;
    无效*pageptr = NULL; /* 默认为“缺失” */

    向下(&dev->sem);
    偏移量 = (地址 - vma->vm_start) + (vma->vm_pgoff << PAGE_SHIFT);
    if (offset >= dev->size) goto out;/* 超出范围 */

    /*
     * 现在从列表中检索 scullp 设备,然后是页面。
     * 如果设备有漏洞,进程会收到 SIGBUS
     * 进入洞。
     */
    偏移>>= PAGE_SHIFT;/* offset是页数*/
    for (ptr = dev; ptr && offset >= dev->qset;) {
        ptr = ptr->下一个;
        偏移量-= dev->qset;
    }
    if (ptr && ptr->data) pageptr = ptr->data[偏移];
    if (!pageptr) 转到出去;/* 漏洞或文件结尾 */
    页面 = virt_to_page(pageptr);
    
    /* 明白了,现在增加计数 */
    获取页面(页面);
    如果(类型)
        *类型=VM_FAULT_MINOR;
  出去:
    向上(&dev->sem);
    返回页面;
}
struct page *scullp_vma_nopage(struct vm_area_struct *vma,
                                unsigned long address, int *type)
{
    unsigned long offset;
    struct scullp_dev *ptr, *dev = vma->vm_private_data;
    struct page *page = NOPAGE_SIGBUS;
    void *pageptr = NULL; /* default to "missing" */

    down(&dev->sem);
    offset = (address - vma->vm_start) + (vma->vm_pgoff << PAGE_SHIFT);
    if (offset >= dev->size) goto out; /* out of range */

    /*
     * Now retrieve the scullp device from the list,then the page.
     * If the device has holes, the process receives a SIGBUS when
     * accessing the hole.
     */
    offset >>= PAGE_SHIFT; /* offset is a number of pages */
    for (ptr = dev; ptr && offset >= dev->qset;) {
        ptr = ptr->next;
        offset -= dev->qset;
    }
    if (ptr && ptr->data) pageptr = ptr->data[offset];
    if (!pageptr) goto out; /* hole or end-of-file */
    page = virt_to_page(pageptr);
    
    /* got it, now increment the count */
    get_page(page);
    if (type)
        *type = VM_FAULT_MINOR;
  out:
    up(&dev->sem);
    return page;
}

scullp使用通过get_free_pages获得的内存 。该内存是使用逻辑地址寻址的,因此scullp_nopage获取指针所需要做的struct page就是调用 virt_to_page

scullp uses memory obtained with get_free_pages. That memory is addressed using logical addresses, so all scullp_nopage has to do to get a struct page pointer is to call virt_to_page.

scullp设备现在可以按预期工作,正如您在映射器实用程序的示例输出中看到的那样。在这里,我们将/dev的目录列表(很长)发送到 scullp设备,然后使用映射器实用程序通过mmap 查看该列表的各个部分:

The scullp device now works as expected, as you can see in this sample output from the mapper utility. Here, we send a directory listing of /dev (which is long) to the scullp device and then use the mapper utility to look at pieces of that listing with mmap:

莫甘娜%ls -l /dev > /dev/scullp
莫甘娜%./mapper /dev/scullp 0 140
将“/dev/scullp”从 0 (0x00000000) 映射到 140 (0x0000008c)
总计 232
crw----- 1 root root 10, 10 Sep 15 07:40 adbmouse
crw-r--r-- 1 根 根 10, 175 九月 15 07:40 agpgart
莫甘娜%./mapper /dev/scullp 8192 200
将“/dev/scullp”从 8192 (0x00002000) 映射到 8392 (0x000020c8)
d0h1494
brw-rw---- 1 根软盘 2, 92 Sep 15 07:40 fd0h1660
brw-rw---- 1 根软盘 2, 20 Sep 15 07:40 fd0h360
brw-rw---- 1 根软盘 2, 2015 年 9 月 12 日 07:40 fd0H360
morgana% ls -l /dev > /dev/scullp
morgana% ./mapper /dev/scullp 0 140
mapped "/dev/scullp" from 0 (0x00000000) to 140 (0x0000008c)
total 232
crw-------    1 root     root      10,  10 Sep 15 07:40 adbmouse
crw-r--r--    1 root     root      10, 175 Sep 15 07:40 agpgart
morgana% ./mapper /dev/scullp 8192 200
mapped "/dev/scullp" from 8192 (0x00002000) to 8392 (0x000020c8)
d0h1494
brw-rw----    1 root     floppy     2,  92 Sep 15 07:40 fd0h1660
brw-rw----    1 root     floppy     2,  20 Sep 15 07:40 fd0h360
brw-rw----    1 root     floppy     2,  12 Sep 15 07:40 fd0H360

重新映射内核虚拟地址

Remapping Kernel Virtual Addresses

虽然很少见 有必要,看看驱动程序如何使用以下命令将内核虚拟地址映射到用户空间是很有趣的 mmap将内核虚拟地址映射到用户空间是很有趣的。请记住,真正的内核虚拟地址是函数返回的地址例如 vmalloc——即映射到内核页表中的虚拟地址。本节中的代码取自scullv,该模块的工作方式类似于scullp但通过分配其存储 类似,但通过vmalloc

Although it's rarely necessary, it's interesting to see how a driver can map a kernel virtual address to user space using mmap. A true kernel virtual address, remember, is an address returned by a function such as vmalloc—that is, a virtual address mapped in the kernel page tables. The code in this section is taken from scullv, which is the module that works like scullp but allocates its storage through vmalloc.

大多数scullv实现与我们刚刚看到的scullp实现类似,只是不需要检查order控制内存分配的参数。原因是vmalloc一次分配一页,因为单页分配比多页分配成功的可能性要大得多。因此,分配顺序问题不适用于vmalloc空间。

Most of the scullv implementation is like the one we've just seen for scullp, except that there is no need to check the order parameter that controls memory allocation. The reason for this is that vmalloc allocates its pages one at a time, because single-page allocations are far more likely to succeed than multipage allocations. Therefore, the allocation order problem doesn't apply to vmalloced space.

除此之外,scullpscullv 使用的nopage实现之间只有一个区别。请记住,scullp一旦找到感兴趣的页面,就会通过virt_to_page获得相应的指针。然而,该函数不适用于内核虚拟地址。相反,您必须使用 vmalloc_to_page所以nopagescullv版本的最后部分 如下所示:struct page

Beyond that, there is only one difference between the nopage implementations used by scullp and scullv. Remember that scullp, once it found the page of interest, would obtain the corresponding struct page pointer with virt_to_page. That function does not work with kernel virtual addresses, however. Instead, you must use vmalloc_to_page. So the final part of the scullv version of nopage looks like:

  /*
   * scullv查找后,“page”现在是页面的地址
   * 当前进程需要。由于它是vmalloc地址,
   * 把它变成一个struct page。
   */
  页面 = vmalloc_to_page(pageptr);
    
  /* 明白了,现在增加计数 */
  获取页面(页面);
  如果(类型)
      *类型=VM_FAULT_MINOR;
出去:
  向上(&dev->sem);
  返回页面;
  /*
   * After scullv lookup, "page" is now the address of the page
   * needed by the current process. Since it's a vmalloc address,
   * turn it into a struct page.
   */
  page = vmalloc_to_page(pageptr);
    
  /* got it, now increment the count */
  get_page(page);
  if (type)
      *type = VM_FAULT_MINOR;
out:
  up(&dev->sem);
  return page;

基于这次讨论, 您可能还想将ioremap返回的地址映射到用户空间。然而,这是一个错误。ioremap中的地址很特殊,不能像普通的内核虚拟地址一样对待。相反,您应该使用remap_pfn_range将 I/O 内存区域重新映射到用户空间。

Based on this discussion, you might also want to map addresses returned by ioremap to user space. That would be a mistake, however; addresses from ioremap are special and cannot be treated like normal kernel virtual addresses. Instead, you should use remap_pfn_range to remap I/O memory areas into user space.

执行直接 I/O

Performing Direct I/O

大多数 I/O 操作是 缓冲的 通过内核。内核空间缓冲区的使用允许用户空间和实际设备之间有一定程度的分离;这种分离可以使编程变得更容易,并且在许多情况下还可以带来性能优势。然而,在某些情况下,直接向用户空间缓冲区执行 I/O 或从用户空间缓冲区执行 I/O 可能是有益的。如果传输的数据量很大,则直接通过内核空间传输数据而不需要额外的副本可以加快速度。

Most I/O operations are buffered through the kernel. The use of a kernel-space buffer allows a degree of separation between user space and the actual device; this separation can make programming easier and can also yield performance benefits in many situations. There are cases, however, where it can be beneficial to perform I/O directly to or from a user-space buffer. If the amount of data being transferred is large, transferring data directly without an extra copy through kernel space can speed things up.

2.6 内核中直接 I/O 使用的示例之一是 SCSI 磁带驱动程序。流式磁带可以通过系统传递大量数据,而磁带传输通常是面向记录的,因此在内核中缓冲数据几乎没有什么好处。因此,当条件合适时(例如,用户空间缓冲区是页对齐的),SCSI 磁带驱动程序将执行其 I/O,而不复制数据。

One example of direct I/O use in the 2.6 kernel is the SCSI tape driver. Streaming tapes can pass a lot of data through the system, and tape transfers are usually record-oriented, so there is little benefit to buffering data in the kernel. So, when the conditions are right (the user-space buffer is page-aligned, for example), the SCSI tape driver performs its I/O without copying the data.

也就是说,重要的是要认识到直接 I/O 并不总能提供人们所期望的性能提升。设置直接 I/O(涉及故障并锁定相关用户页面)的开销可能很大,并且会丧失缓冲 I/O 的优势。例如,使用直接I/O需要 write系统调用同步操作;否则应用程序不知道何时可以重用其 I/O 缓冲区。停止应用程序直到每次写入完成可能会减慢速度,这就是使用直接 I/O 的应用程序也经常使用异步 I/O 操作的原因。

That said, it is important to recognize that direct I/O does not always provide the performance boost that one might expect. The overhead of setting up direct I/O (which involves faulting in and pinning down the relevant user pages) can be significant, and the benefits of buffered I/O are lost. For example, the use of direct I/O requires that the write system call operate synchronously; otherwise the application does not know when it can reuse its I/O buffer. Stopping the application until each write completes can slow things down, which is why applications that use direct I/O often use asynchronous I/O operations as well.

无论如何,这个故事的真正寓意是,在字符驱动程序中实现直接 I/O 通常是不必要的,而且可能会造成伤害。仅当您确定缓冲 I/O 的开销确实会减慢速度时,才应采取该步骤。另请注意,块和网络驱动程序根本不需要担心直接 I/O 的实现;在这两种情况下,内核中的高级代码都会在指示时设置并使用直接 I/O,并且驱动程序级代码甚至不需要知道正在执行直接 I/O。

The real moral of the story, in any case, is that implementing direct I/O in a char driver is usually unnecessary and can be hurtful. You should take that step only if you are sure that the overhead of buffered I/O is truly slowing things down. Note also that block and network drivers need not worry about implementing direct I/O at all; in both cases, higher-level code in the kernel sets up and makes use of direct I/O when it is indicated, and driver-level code need not even know that direct I/O is being performed.

在2.6内核中实现直接I/O的关键是一个名为 get_user_pages的函数 ,它在<linux/mm.h>中声明,原型如下:

The key to implementing direct I/O in the 2.6 kernel is a function called get_user_pages , which is declared in <linux/mm.h> with the following prototype:

int get_user_pages(struct task_struct *tsk,
                   结构mm_struct *mm,
                   无符号长开始,
                   国际长度,
                   整数写,
                   内在力量,
                   结构页面**页面,
                   结构 vm_area_struct **vmas);
int get_user_pages(struct task_struct *tsk, 
                   struct mm_struct *mm, 
                   unsigned long start,
                   int len, 
                   int write, 
                   int force, 
                   struct page **pages, 
                   struct vm_area_struct **vmas);

该函数有几个参数:

This function has several arguments:

tsk
tsk

指向执行 I/O 的任务的指针;它的主要目的是告诉内核谁应该为设置缓冲区时发生的任何页面错误负责。该参数几乎总是作为 传递current

A pointer to the task performing the I/O; its main purpose is to tell the kernel who should be charged for any page faults incurred while setting up the buffer. This argument is almost always passed as current.

mm
mm

指向描述要映射的地址空间的内存管理结构的指针。该mm_struct结构是将进程虚拟地址空间的所有部分 (VMA) 连接在一起的部分。对于驱动程序使用,此参数应始终为current->mm

A pointer to the memory management structure describing the address space to be mapped. The mm_struct structure is the piece that ties together all of the parts (VMAs) of a process's virtual address space. For driver use, this argument should always be current->mm.

start

len
start

len

start是用户空间缓冲区的(页对齐)地址,len是缓冲区的页长度。

start is the (page-aligned) address of the user-space buffer, and len is the length of the buffer in pages.

write

force
write

force

如果write非零,则页面被映射用于写访问(当然,这意味着用户空间正在执行读操作)。该force标志告诉 get_user_pages覆盖给定页面上的保护以提供请求的访问权限;司机应该经常经过0这里。

If write is nonzero, the pages are mapped for write access (implying, of course, that user space is performing a read operation). The force flag tells get_user_pages to override the protections on the given pages to provide the requested access; drivers should always pass 0 here.

pages

vmas
pages

vmas

输出参数。成功完成后,包含指向描述用户空间缓冲区的结构的pages指针列表,并包含指向关联 VMA 的指针。显然,参数应该指向至少能够保存指针的数组。任一参数都可以是,但您至少需要对缓冲区进行实际操作的指针。struct pagevmaslenNULLstruct page

Output parameters. Upon successful completion, pages contain a list of pointers to the struct page structures describing the user-space buffer, and vmas contains pointers to the associated VMAs. The parameters should, obviously, point to arrays capable of holding at least len pointers. Either parameter can be NULL, but you need, at least, the struct page pointers to actually operate on the buffer.

get_user_pages是一个低级内存管理函数,具有适当复杂的接口。它还要求在调用之前以读取模式获取地址空间的 mmap 读写器信号量。因此,对 get_user_pages 的调用通常类似于:

get_user_pages is a low-level memory management function, with a suitably complex interface. It also requires that the mmap reader/writer semaphore for the address space be obtained in read mode before the call. As a result, calls to get_user_pages usually look something like:

down_read(&当前->mm->mmap_sem);
结果 = get_user_pages(当前, 当前->mm, ...);
up_read(&当前->mm->mmap_sem);
down_read(&current->mm->mmap_sem);
result = get_user_pages(current, current->mm, ...);
up_read(&current->mm->mmap_sem);

返回值是实际映射的页面数,该数可能小于请求的数(但大于零)。

The return value is the number of pages actually mapped, which could be fewer than the number requested (but greater than zero).

成功完成后,调用者有一个pages 指向用户空间缓冲区的数组,被锁定在内存中。要直接操作缓冲区,内核空间代码必须使用kmapkmap_atomicstruct page将每个指针转换为内核虚拟地址 。然而,通常情况下,直接 I/O 合理的设备正在使用 DMA 操作,因此您的驱动程序可能希望从指针数组创建分散/聚集列表。我们将在第 15.4.4.7节中讨论如何执行此操作 。struct page

Upon successful completion, the caller has a pages array pointing to the user-space buffer, which is locked into memory. To operate on the buffer directly, the kernel-space code must turn each struct page pointer into a kernel virtual address with kmap or kmap_atomic. Usually, however, devices for which direct I/O is justified are using DMA operations, so your driver will probably want to create a scatter/gather list from the array of struct page pointers. We discuss how to do this in the section, Section 15.4.4.7.

一旦直接 I/O 操作完成,您必须释放用户页面。然而,在这样做之前,如果您更改了这些页面的内容,则必须通知内核。否则,内核可能会认为这些页面是“干净的”,这意味着它们与交换设备上找到的副本匹配,并释放它们而不将它们写入后备存储。因此,如果您更改了页面(响应用户空间读取请求),则必须通过调用将每个受影响的页面标记为脏:

Once your direct I/O operation is complete, you must release the user pages. Before doing so, however, you must inform the kernel if you changed the contents of those pages. Otherwise, the kernel may think that the pages are "clean," meaning that they match a copy found on the swap device, and free them without writing them out to backing store. So, if you have changed the pages (in response to a user-space read request), you must mark each affected page dirty with a call to:

void SetPageDirty(struct page *page);
void SetPageDirty(struct page *page);

(该宏在<linux/page-flags.h>中定义)。大多数执行此操作的代码首先检查以确保该页面不在内存映射的保留部分中,该部分永远不会被换出。因此,代码通常如下所示:

(This macro is defined in <linux/page-flags.h>). Most code that performs this operation checks first to ensure that the page is not in the reserved part of the memory map, which is never swapped out. Therefore, the code usually looks like:

if (!PageReserved(页))
    SetPageDirty(页);
if (! PageReserved(page))
    SetPageDirty(page);

由于用户空间内存通常不会被标记为保留,因此严格来说这项检查并不是必要的,但是当您深入内存管理子系统时,最好彻底而小心。

Since user-space memory is not normally marked reserved, this check should not strictly be necessary, but when you are getting your hands dirty deep within the memory management subsystem, it is best to be thorough and careful.

无论页面是否已更改,它们都必须从页面缓存中释放,否则将永远保留在那里。使用的调用是:

Regardless of whether the pages have been changed, they must be freed from the page cache, or they stay there forever. The call to use is:

void page_cache_release(struct page *page);
void page_cache_release(struct page *page);

当然,如果需要的话,应该在页面被标记为脏之后进行此调用。

This call should, of course, be made after the page has been marked dirty, if need be.

异步I/O

Asynchronous I/O

2.6 内核中添加的新功能之一是异步 I/O 能力。异步 I/O 允许用户空间启动操作而无需等待操作完成;因此,应用程序可以在 I/O 运行时执行其他处理。复杂的高性能应用程序还可以使用异步 I/O 来同时进行多个操作。

One of the new features added to the 2.6 kernel was the asynchronous I/O capability. Asynchronous I/O allows user space to initiate operations without waiting for their completion; thus, an application can do other processing while its I/O is in flight. A complex, high-performance application can also use asynchronous I/O to have multiple operations going at the same time.

实施异步 I/O 是可选的,很少有驱动程序作者会费心;大多数设备无法从此功能中受益。正如我们将在接下来的章节中看到的,块和网络驱动程序在任何时候都是完全异步的,因此只有字符驱动程序才是显式异步 I/O 支持的候选者。如果有充分的理由在任何给定时间有多个未完成的 I/O 操作,则字符设备可以从这种支持中受益。一个很好的例子是流式磁带驱动器,如果 I/O 操作到达得不够快,驱动器可能会停止运行并显着减慢速度。试图从流驱动器中获得最佳性能的应用程序可以使用异步 I/O 在任何给定时间准备好多个操作。

The implementation of asynchronous I/O is optional, and very few driver authors bother; most devices do not benefit from this capability. As we will see in the coming chapters, block and network drivers are fully asynchronous at all times, so only char drivers are candidates for explicit asynchronous I/O support. A char device can benefit from this support if there are good reasons for having more than one I/O operation outstanding at any given time. One good example is streaming tape drives, where the drive can stall and slow down significantly if I/O operations do not arrive quickly enough. An application trying to get the best performance out of a streaming drive could use asynchronous I/O to have multiple operations ready to go at any given time.

对于需要实现异步 I/O 的罕见驱动程序作者,我们提供了其工作原理的快速概述。我们在本章中介绍异步 I/O,因为它的实现几乎总是涉及直接 I/O 操作(如果您在内核中缓冲数据,通常可以实现异步行为,而不会增加用户空间的复杂性)。

For the rare driver author who needs to implement asynchronous I/O, we present a quick overview of how it works. We cover asynchronous I/O in this chapter, because its implementation almost always involves direct I/O operations as well (if you are buffering data in the kernel, you can usually implement asynchronous behavior without imposing the added complexity on user space).

支持异步 I/O 的驱动程序应包含<linux/aio.h>。 实现异步I/O的file_operations方法有3种:

Drivers supporting asynchronous I/O should include <linux/aio.h>. There are three file_operations methods for the implementation of asynchronous I/O:

ssize_t (*aio_read) (struct kiocb *iocb, char *buffer,
                     size_t 计数,loff_t 偏移量);
ssize_t (*aio_write) (struct kiocb *iocb, const char *buffer,
                      size_t 计数,loff_t 偏移量);
int (*aio_fsync) (struct kiocb *iocb, int datasync);
ssize_t (*aio_read) (struct kiocb *iocb, char *buffer, 
                     size_t count, loff_t offset);
ssize_t (*aio_write) (struct kiocb *iocb, const char *buffer, 
                      size_t count, loff_t offset);
int (*aio_fsync) (struct kiocb *iocb, int datasync);

aio_fsync _ 操作仅对文件系统代码感兴趣,因此我们在这里不再进一步讨论。另外两个aio_readaio_write看起来非常像常规的读取写入方法,但有一些例外。一是offset 参数按值传递;异步操作永远不会更改文件位置,因此没有理由传递指向它的指针。这些方法还采用iocb(“I/O 控制块”)参数,我们稍后会介绍该参数。

The aio_fsync operation is only of interest to filesystem code, so we do not discuss it further here. The other two, aio_read and aio_write, look very much like the regular read and write methods but with a couple of exceptions. One is that the offset parameter is passed by value; asynchronous operations never change the file position, so there is no reason to pass a pointer to it. These methods also take the iocb ("I/O control block") parameter, which we get to in a moment.

aio_readaio_write方法的目的 是启动读取或写入操作,该操作在它们返回时可能已完成,也可能尚未完成。如果可以立即完成操作,则该方法应该这样做并返回通常的状态:传输的字节数或负错误代码。因此,如果您的驱动程序有一个名为my_read的读取 方法,则以下aio_read 方法是完全正确的(尽管毫无意义):

The purpose of the aio_read and aio_write methods is to initiate a read or write operation that may or may not be complete by the time they return. If it is possible to complete the operation immediately, the method should do so and return the usual status: the number of bytes transferred or a negative error code. Thus, if your driver has a read method called my_read, the following aio_read method is entirely correct (though rather pointless):

静态 ssize_t my_aio_read(struct kiocb *iocb, char *buffer,
                           ssize_t 计数,loff_t 偏移)
{
    返回 my_read(iocb->ki_filp, 缓冲区, 计数, &offset);
}
static ssize_t my_aio_read(struct kiocb *iocb, char *buffer, 
                           ssize_t count, loff_t offset)
{
    return my_read(iocb->ki_filp, buffer, count, &offset);
}

请注意,struct file指针位于ki_filp结构体的字段中kiocb

Note that the struct file pointer is found in the ki_filp field of the kiocb structure.

如果支持异步 I/O,则必须意识到内核有时可以创建“同步 I/O”。这些本质上是异步操作,实际上必须同步执行。人们可能很想知道为什么要这样做,但最好只按照内核的要求去做。同步操作在IOCB中标记;您的驱动程序应使用以下方式查询该状态:

If you support asynchronous I/O, you must be aware of the fact that the kernel can, on occasion, create "synchronous IOCBs." These are, essentially, asynchronous operations that must actually be executed synchronously. One may well wonder why things are done this way, but it's best to just do what the kernel asks. Synchronous operations are marked in the IOCB; your driver should query that status with:

int is_sync_kiocb(struct kiocb *iocb);
int is_sync_kiocb(struct kiocb *iocb);

如果此函数返回非零值,则您的驱动程序必须同步执行该操作。

If this function returns a nonzero value, your driver must execute the operation synchronously.

然而,最终,所有这些结构的目的是启用异步操作。如果您的驱动程序能够启动该操作(或者简单地说,将其排队直到将来可以执行该操作),它必须做两件事:记住它需要了解的有关该操作的所有信息,然后返回调用者-EIOCBQUEUED。记住操作信息包括安排对用户空间缓冲区的访问;一旦返回,在调用进程的上下文中运行时,您将不再有机会访问该缓冲区。一般来说,这意味着您可能必须设置直接内核映射(使用 get_user_pages)或 DMA 映射。这-EIOCBQUEUED错误代码表示操作尚未完成,其最终状态将在稍后发布。

In the end, however, the point of all this structure is to enable asynchronous operations. If your driver is able to initiate the operation (or, simply, to queue it until some future time when it can be executed), it must do two things: remember everything it needs to know about the operation, and return -EIOCBQUEUED to the caller. Remembering the operation information includes arranging access to the user-space buffer; once you return, you will not again have the opportunity to access that buffer while running in the context of the calling process. In general, that means you will likely have to set up a direct kernel mapping (with get_user_pages) or a DMA mapping. The -EIOCBQUEUED error code indicates that the operation is not yet complete, and its final status will be posted later.

当“稍后”到来时,您的驱动程序必须通知内核操作已完成。这是通过调用aio_complete来完成的:

When "later" comes, your driver must inform the kernel that the operation has completed. That is done with a call to aio_complete:

int aio_complete(struct kiocb *iocb, 长 res, 长 res2);
int aio_complete(struct kiocb *iocb, long res, long res2);

这里,iocb是最初传递给您的同一个 IOCB,并且res是操作的通常结果状态。res2是将返回到用户空间的第二结果代码;大多数异步 I/O 实现都res2作为0. 一旦调用 aio_complete,您就不应该再次触摸 IOCB 或用户缓冲区。

Here, iocb is the same IOCB that was initially passed to you, and res is the usual result status for the operation. res2 is a second result code that will be returned to user space; most asynchronous I/O implementations pass res2 as 0. Once you call aio_complete, you should not touch the IOCB or user buffer again.

异步 I/O 示例

An asynchronous I/O example

示例源代码中面向页面的scullp驱动程序实现了异步 I/O。实现很简单,但足以展示异步操作应该如何构造。

The page-oriented scullp driver in the example source implements asynchronous I/O. The implementation is simple, but it is enough to show how asynchronous operations should be structured.

aio_read和aio_write方法实际上没有做太多事情:

The aio_read and aio_write methods don't actually do much:

静态 ssize_t scullp_aio_read(struct kiocb *iocb, char *buf, size_t 计数,
        loff_t 位置)
{
    返回 scullp_defer_op(0, iocb, buf, 计数, pos);
}

静态 ssize_t scullp_aio_write(struct kiocb *iocb, const char *buf,
        size_t 计数,loff_t 位置)
{
    返回 scullp_defer_op(1, iocb, (char *) buf, count, pos);
}
static ssize_t scullp_aio_read(struct kiocb *iocb, char *buf, size_t count,
        loff_t pos)
{
    return scullp_defer_op(0, iocb, buf, count, pos);
}

static ssize_t scullp_aio_write(struct kiocb *iocb, const char *buf,
        size_t count, loff_t pos)
{
    return scullp_defer_op(1, iocb, (char *) buf, count, pos);
}

这些方法只是调用一个公共函数:

These methods simply call a common function:

结构 async_work {
    结构 kiocb *iocb;
    整数结果;
    结构work_结构工作;
};

静态 int scullp_defer_op(int write, struct kiocb *iocb, char *buf,
        size_t 计数,loff_t 位置)
{
    结构 async_work *stuff;
    整数结果;

    /* 现在复制,我们可以访问缓冲区 */
    如果(写)
        结果 = scullp_write(iocb->ki_filp, buf, count, &pos);
    别的
        结果 = scullp_read(iocb->ki_filp, buf, count, &pos);

    /* 如果这是同步 IOCB,我们现在返回我们的状态。*/
    如果(is_sync_kiocb(iocb))
        返回结果;

    /* 否则将完成延迟几毫秒。*/
    stuff = kmalloc (sizeof (*stuff), GFP_KERNEL);
    如果(东西== NULL)
        返回结果;/* 没有记忆,现在就完成 */
    东西->iocb = iocb;
    东西->结果=结果;
    INIT_WORK(&stuff->工作, scullp_do_deferred_op, stuff);
    Schedule_delayed_work(&stuff->工作, HZ/100);
    返回-EIOCBQUEUED;
}
struct async_work {
    struct kiocb *iocb;
    int result;
    struct work_struct work;
};

static int scullp_defer_op(int write, struct kiocb *iocb, char *buf,
        size_t count, loff_t pos)
{
    struct async_work *stuff;
    int result;

    /* Copy now while we can access the buffer */
    if (write)
        result = scullp_write(iocb->ki_filp, buf, count, &pos);
    else
        result = scullp_read(iocb->ki_filp, buf, count, &pos);

    /* If this is a synchronous IOCB, we return our status now. */
    if (is_sync_kiocb(iocb))
        return result;

    /* Otherwise defer the completion for a few milliseconds. */
    stuff = kmalloc (sizeof (*stuff), GFP_KERNEL);
    if (stuff =  = NULL)
        return result; /* No memory, just complete now */
    stuff->iocb = iocb;
    stuff->result = result;
    INIT_WORK(&stuff->work, scullp_do_deferred_op, stuff);
    schedule_delayed_work(&stuff->work, HZ/100);
    return -EIOCBQUEUED;
}

更完整的实现将使用get_user_pages将用户缓冲区映射到内核空间。我们选择通过从一开始就复制数据来保持简单。然后调用is_sync_kiocb看这个操作是否必须同步完成;如果是,则返回结果状态,我们就完成了。否则,我们会在一个小结构中记住相关信息,通过工作队列安排“完成”,然后返回-EIOCBQUEUED。此时,控制权返回到用户空间。

A more complete implementation would use get_user_pages to map the user buffer into kernel space. We chose to keep life simple by just copying over the data at the outset. Then a call is made to is_sync_kiocb to see if this operation must be completed synchronously; if so, the result status is returned, and we are done. Otherwise we remember the relevant information in a little structure, arrange for "completion" via a workqueue, and return -EIOCBQUEUED. At this point, control returns to user space.

随后,工作队列执行我们的完成函数:

Later on, the workqueue executes our completion function:

静态无效 scullp_do_deferred_op(void *p)
{
    struct async_work *stuff = (struct async_work *) p;
    aio_complete(东西->iocb, 东西->结果, 0);
    kfree(东西);
}
static void scullp_do_deferred_op(void *p)
{
    struct async_work *stuff = (struct async_work *) p;
    aio_complete(stuff->iocb, stuff->result, 0);
    kfree(stuff);
}

在这里,只需使用我们保存的信息调用aio_complete即可。当然,真正的驱动程序的异步 I/O 实现要复杂一些,但它遵循以下原则 种类 的结构。

Here, it is simply a matter of calling aio_complete with our saved information. A real driver's asynchronous I/O implementation is somewhat more complicated, of course, but it follows this sort of structure.

直接内存访问

Direct Memory Access

直接内存访问,或 DMA ,是完成我们对内存问题的概述的高级主题。DMA 是允许外围组件直接将其 I/O 数据传输至主存储器或从主存储器传输出其 I/O 数据的硬件机制,而无需涉及系统处理器。使用这种机制可以大大增加设备的吞吐量,因为消除了大量的计算开销。

Direct memory access, or DMA , is the advanced topic that completes our overview of memory issues. DMA is the hardware mechanism that allows peripheral components to transfer their I/O data directly to and from main memory without the need to involve the system processor. Use of this mechanism can greatly increase throughput to and from a device, because a great deal of computational overhead is eliminated.

DMA 数据传输概述

Overview of a DMA Data Transfer

在介绍编程细节之前,我们先回顾一下 DMA 传输是如何发生的,为了简化讨论,仅考虑输入传输。

Before introducing the programming details, let's review how a DMA transfer takes place, considering only input transfers to simplify the discussion.

数据传输可以通过两种方式触发:软件请求数据(通过read等函数)或硬件异步将数据推送到系统。

Data transfer can be triggered in two ways: either the software asks for data (via a function such as read) or the hardware asynchronously pushes data to the system.

对于第一种情况,所涉及的步骤可以总结如下:

In the first case, the steps involved can be summarized as follows:

  1. 当进程调用read时,驱动程序方法会分配 DMA 缓冲区并指示硬件将其数据传输到该缓冲区中。该进程被置于睡眠状态。

  2. When a process calls read, the driver method allocates a DMA buffer and instructs the hardware to transfer its data into that buffer. The process is put to sleep.

  3. 硬件将数据写入 DMA 缓冲区并在完成后引发中断。

  4. The hardware writes data to the DMA buffer and raises an interrupt when it's done.

  5. 中断处理程序获取输入数据,确认中断并唤醒进程,该进程现在可以读取数据。

  6. The interrupt handler gets the input data, acknowledges the interrupt, and awakens the process, which is now able to read data.

第二种情况发生在异步使用 DMA 时。例如,即使没有人读取数据,数据采集设备也会继续推送数据,就会发生这种情况。在这种情况下,驱动程序应该维护一个缓冲区,以便后续的读取调用将所有累积的数据返回到用户空间。这种传输涉及的步骤略有不同:

The second case comes about when DMA is used asynchronously. This happens, for example, with data acquisition devices that go on pushing data even if nobody is reading them. In this case, the driver should maintain a buffer so that a subsequent read call will return all the accumulated data to user space. The steps involved in this kind of transfer are slightly different:

  1. 硬件发出中断以宣布新数据已到达。

  2. The hardware raises an interrupt to announce that new data has arrived.

  3. 中断处理程序分配一个缓冲区并告诉硬件将其数据传输到哪里。

  4. The interrupt handler allocates a buffer and tells the hardware where to transfer its data.

  5. 外围设备将数据写入缓冲区,并在完成后引发另一个中断。

  6. The peripheral device writes the data to the buffer and raises another interrupt when it's done.

  7. 处理程序调度新数据,唤醒任何相关进程,并负责内务管理。

  8. The handler dispatches the new data, wakes any relevant process, and takes care of housekeeping.

异步方法的一种变体常见于网卡。这些卡片通常期望看到在与处理器共享的内存中建立的循环缓冲区(通常称为DMA 环形缓冲区);每个传入数据包都被放置在环中的下一个可用缓冲区中,并发出中断信号。然后,驱动程序将网络数据包传递到内核的其余部分,并在环中放置一个新的 DMA 缓冲区。

A variant of the asynchronous approach is often seen with network cards. These cards often expect to see a circular buffer (often called a DMA ring buffer) established in memory shared with the processor; each incoming packet is placed in the next available buffer in the ring, and an interrupt is signaled. The driver then passes the network packets to the rest of the kernel and places a new DMA buffer in the ring.

所有这些情况下的处理步骤都强调有效的 DMA 处理依赖于中断报告。虽然可以使用轮询驱动程序来实现 DMA,但这没有意义,因为轮询驱动程序会浪费 DMA 相对于更简单的处理器驱动 I/O 所提供的性能优势。[ 4 ]

The processing steps in all of these cases emphasize that efficient DMA handling relies on interrupt reporting. While it is possible to implement DMA with a polling driver, it wouldn't make sense, because a polling driver would waste the performance benefits that DMA offers over the easier processor-driven I/O.[4]

这里介绍的另一个相关项目是 DMA 缓冲区。DMA 要求设备驱动程序分配一个或多个适合 DMA 的特殊缓冲区。请注意,许多驱动程序在初始化时分配其缓冲区并使用它们直到关闭 - 因此,前面列表中的“分配”一词 意味着“获取先前分配的缓冲区”。

Another relevant item introduced here is the DMA buffer. DMA requires device drivers to allocate one or more special buffers suited to DMA. Note that many drivers allocate their buffers at initialization time and use them until shutdown—the word allocate in the previous lists, therefore, means "get hold of a previously allocated buffer."

分配 DMA 缓冲区

Allocating the DMA Buffer

本节涵盖分配低级别的 DMA 缓冲区;我们很快就会介绍一个更高级别的界面,但理解此处提供的材料仍然是一个好主意。

This section covers the allocation of DMA buffers at a low level; we introduce a higher-level interface shortly, but it is still a good idea to understand the material presented here.

DMA 缓冲区出现的主要问题是,当它们大于一页时,它们必须占用物理内存中的连续页,因为设备使用 ISA 或 PCI 系统总线传输数据,这两种总线都携带物理地址。值得注意的是,这个限制不适用于 SBus(参见第 12.5 节),它使用外设总线上的虚拟地址。某些体系结构还可以使用 PCI 总线上的虚拟地址,但可移植驱动程序不能依赖该功能。

The main issue that arrises with DMA buffers is that, when they are bigger than one page, they must occupy contiguous pages in physical memory because the device transfers data using the ISA or PCI system bus, both of which carry physical addresses. It's interesting to note that this constraint doesn't apply to the SBus (see Section 12.5), which uses virtual addresses on the peripheral bus. Some architectures can also use virtual addresses on the PCI bus, but a portable driver cannot count on that capability.

虽然 DMA 缓冲区可以在系统启动时或运行时分配,但模块只能在运行时分配其缓冲区。当用于 DMA 操作时,驱动程序编写者必须注意分配正确类型的内存;并非所有内存区域都适合。特别是,高内存可能不适用于某些系统和某些设备上的 DMA,外设根本无法使用那么高的地址。

Although DMA buffers can be allocated either at system boot or at runtime, modules can allocate their buffers only at runtime. Driver writers must take care to allocate the right kind of memory when it is used for DMA operations; not all memory zones are suitable. In particular, high memory may not work for DMA on some systems and with some devices—the peripherals simply cannot work with addresses that high.

现代总线上的大多数设备都可以处理 32 位地址,这意味着正常的内存分配对它们来说效果很好。然而,某些 PCI 设备无法实现完整的 PCI 标准,并且无法使用 32 位地址。当然,ISA 设备仅限于 24 位地址。

Most devices on modern buses can handle 32-bit addresses, meaning that normal memory allocations work just fine for them. Some PCI devices, however, fail to implement the full PCI standard and cannot work with 32-bit addresses. And ISA devices, of course, are limited to 24-bit addresses only.

GFP_DMA对于具有此类限制的设备,应通过将标志添加到 kmallocget_free_pages调用来从 DMA 区域分配内存。当此标志存在时,仅分配可以用 24 位寻址的内存。或者,您可以使用通用 DMA 层(我们稍后讨论)来分配缓冲区,以解决设备的限制。

For devices with this kind of limitation, memory should be allocated from the DMA zone by adding the GFP_DMA flag to the kmalloc or get_free_pages call. When this flag is present, only memory that can be addressed with 24 bits is allocated. Alternatively, you can use the generic DMA layer (which we discuss shortly) to allocate buffers that work around your device's limitations.

自己动手分配

Do-it-yourself allocation

我们已经看到get_free_pages可以分配最多几兆字节(顺序范围最大为MAX_ORDER11),但是即使请求的缓冲区远小于 128 KB,高阶请求也很容易失败,因为系统内存变得碎片化随着时间的推移。[ 5 ]

We have seen how get_free_pages can allocate up to a few megabytes (as order can range up to MAX_ORDER, currently 11), but high-order requests are prone to fail even when the requested buffer is far less than 128 KB, because system memory becomes fragmented over time.[5]

当内核无法返回所请求的内存量或需要超过 128 KB 的内存时(例如 PCI 帧采集器的常见要求),返回的另一种方法是在启动时分配内存或保留物理 RAM 的-ENOMEM顶部你的缓冲区。我们在第 8.6 节中描述了启动时的分配,但它不适用于模块。保留 RAM 的顶部是通过mem=在启动时向内核传递一个参数来完成的。例如,如果您有 256 MB,则该参数mem=255M会阻止内核使用顶部兆字节。您的模块稍后可以使用以下代码来访问此类内存:

When the kernel cannot return the requested amount of memory or when you need more than 128 KB (a common requirement for PCI frame grabbers, for example), an alternative to returning -ENOMEM is to allocate memory at boot time or reserve the top of physical RAM for your buffer. We described allocation at boot time in Section 8.6, but it is not available to modules. Reserving the top of RAM is accomplished by passing a mem= argument to the kernel at boot time. For example, if you have 256 MB, the argument mem=255M keeps the kernel from using the top megabyte. Your module could later use the following code to gain access to such memory:

dmabuf = ioremap(0xFF00000 /* 255M */, 0x100000 /* 1M */);
dmabuf = ioremap (0xFF00000 /* 255M */, 0x100000 /* 1M */);

分配器是本书附带的示例代码的一部分,它提供了一个简单的 API 来探测和管理此类保留的 RAM,并已在多种体系结构上成功使用但是,当您拥有高内存系统(即物理内存多于 CPU 地址空间所能容纳的内存量的系统)时,此技巧不起作用。

The allocator, part of the sample code accompanying the book, offers a simple API to probe and manage such reserved RAM and has been used successfully on several architectures. However, this trick doesn't work when you have an high-memory system (i.e., one with more physical memory than could fit in the CPU address space).

当然,另一种选择是使用GFP_NOFAIL分配标志来分配缓冲区。然而,这种方法确实会给内存管理子系统带来严重压力,并且存在完全锁定系统的风险;除非确实没有其他办法,否则最好避免这样做。

Another option, of course, is to allocate your buffer with the GFP_NOFAIL allocation flag. This approach does, however, severely stress the memory management subsystem, and it runs the risk of locking up the system altogether; it is best avoided unless there is truly no other way.

然而,如果您打算使用这样的长度来分配一个大的 DMA 缓冲区,那么值得考虑一下替代方案。如果您的设备可以分散/聚集 I/O,您可以将缓冲区分配为更小的部分,然后让设备完成其余的工作。当对用户空间执行直接 I/O 时,也可以使用分散/聚集 I/O,当需要真正巨大的缓冲区时,这很可能是最佳解决方案。

If you are going to such lengths to allocate a large DMA buffer, however, it is worth putting some thought into alternatives. If your device can do scatter/gather I/O, you can allocate your buffer in smaller pieces and let the device do the rest. Scatter/gather I/O can also be used when performing direct I/O into user space, which may well be the best solution when a truly huge buffer is required.

巴士地址

Bus Addresses

使用 DMA 的设备驱动程序必须与连接到接口总线的硬件进行通信,该硬件使用物理地址,而程序代码使用虚拟地址。

A device driver using DMA has to talk to hardware connected to the interface bus, which uses physical addresses, whereas program code uses virtual addresses.

事实上,情况比这稍微复杂一些。基于 DMA 的硬件使用总线地址,而不是物理地址。尽管 ISA 和 PCI 总线地址只是 PC 上的物理地址,但并非每个平台都如此。有时,接口总线通过桥接电路连接,该桥接电路将 I/O 地址映射到不同的物理地址。有些系统甚至具有页面映射方案,可以使任意页面看起来与外围总线相邻。

As a matter of fact, the situation is slightly more complicated than that. DMA-based hardware uses bus, rather than physical, addresses. Although ISA and PCI bus addresses are simply physical addresses on the PC, this is not true for every platform. Sometimes the interface bus is connected through bridge circuitry that maps I/O addresses to different physical addresses. Some systems even have a page-mapping scheme that can make arbitrary pages appear contiguous to the peripheral bus.

在最低级别(同样,我们很快就会看到更高级别的解决方案),Linux 内核通过导出<asm/io.h>中定义的以下函数来提供可移植的解决方案 。强烈建议不要使用这些函数,因为它们只能在具有非常简单 I/O 架构的系统上正常工作;尽管如此,在使用内核代码时您可能会遇到它们。

At the lowest level (again, we'll look at a higher-level solution shortly), the Linux kernel provides a portable solution by exporting the following functions, defined in <asm/io.h>. The use of these functions is strongly discouraged, because they work properly only on systems with a very simple I/O architecture; nonetheless, you may encounter them when working with kernel code.

无符号长 virt_to_bus(易失性无效*地址);
void *bus_to_virt(无符号长地址);
unsigned long virt_to_bus(volatile void *address);
void *bus_to_virt(unsigned long address);

这些函数执行内核逻辑地址和总线地址之间的简单转换。它们在必须对 I/O 内存管理单元进行编程或必须使用反弹缓冲区的任何情况下都不起作用。执行此转换的正确方法是使用通用 DMA 层,因此我们现在继续讨论该主题。

These functions perform a simple conversion between kernel logical addresses and bus addresses. They do not work in any situation where an I/O memory management unit must be programmed or where bounce buffers must be used. The right way of performing this conversion is with the generic DMA layer, so we now move on to that topic.

通用 DMA 层

The Generic DMA Layer

DMA 操作,在 最后,归根结底是分配缓冲区并将总线地址传递给您的设备。然而,编写在所有架构上安全、正确地执行 DMA 的便携式驱动程序的任务比人们想象的要困难。不同的系统对于缓存一致性如何工作有不同的想法;如果您没有正确处理此问题,您的驱动程序可能会损坏内存。有些系统具有复杂的总线硬件,这可能会使 DMA 任务变得更容易或更困难。并且并非所有系统都可以对内存的所有部分执行 DMA。幸运的是,内核提供了一个独立于总线和体系结构的 DMA 层,它向驱动程序作者隐藏了大部分此类问题。我们强烈鼓励您在编写的任何驱动程序中使用该层进行 DMA 操作。

DMA operations, in the end, come down to allocating a buffer and passing bus addresses to your device. However, the task of writing portable drivers that perform DMA safely and correctly on all architectures is harder than one might think. Different systems have different ideas of how cache coherency should work; if you do not handle this issue correctly, your driver may corrupt memory. Some systems have complicated bus hardware that can make the DMA task easier—or harder. And not all systems can perform DMA out of all parts of memory. Fortunately, the kernel provides a bus- and architecture-independent DMA layer that hides most of these issues from the driver author. We strongly encourage you to use this layer for DMA operations in any driver you write.

下面的许多函数都需要一个指向struct device. 该结构是 Linux 设备模型中设备的低级表示。驱动程序通常不需要直接使用它,但在使用通用 DMA 层时确实需要它。通常,您可以找到埋藏在描述您的设备的特定总线内的结构。例如,它可以作为或dev中的字段找到。第 14 章详细介绍了该 结构。struct pci_devicestruct usb_devicedevice

Many of the functions below require a pointer to a struct device. This structure is the low-level representation of a device within the Linux device model. It is not something that drivers often have to work with directly, but you do need it when using the generic DMA layer. Usually, you can find this structure buried inside the bus specific that describes your device. For example, it can be found as the dev field in struct pci_device or struct usb_device. The device structure is covered in detail in Chapter 14.

使用以下函数的驱动程序应包含<linux/dma-mapping.h>

Drivers that use the following functions should include <linux/dma-mapping.h>.

处理困难的硬件

Dealing with difficult hardware

第一个问题是 在尝试 DMA 之前必须回答的是给定设备是否能够在当前主机上执行此类操作。由于多种原因,许多设备可寻址的内存范围受到限制。默认情况下,内核假定您的设备可以对任何 32 位地址执行 DMA。如果情况并非如此,您应该通过调用以下命令将该事实告知内核:

The first question that must be answered before attempting DMA is whether the given device is capable of such an operation on the current host. Many devices are limited in the range of memory they can address, for a number of reasons. By default, the kernel assumes that your device can perform DMA to any 32-bit address. If this is not the case, you should inform the kernel of that fact with a call to:

    int dma_set_mask(struct device *dev, u64 mask);
    int dma_set_mask(struct device *dev, u64 mask);

mask显示您的设备可以寻址的位;例如,如果它限制为 24 位,则您可以将其传递mask0x0FFFFFF. 如果给定的 DMA 可行,则返回值非零mask;如果dma_set_mask返回0,则您无法对此设备使用 DMA 操作。因此,仅限 24 位 DMA 操作的设备驱动程序中的初始化代码可能如下所示:

The mask should show the bits that your device can address; if it is limited to 24 bits, for example, you would pass mask as 0x0FFFFFF. The return value is nonzero if DMA is possible with the given mask; if dma_set_mask returns 0, you are not able to use DMA operations with this device. Thus, the initialization code in a driver for a device limited to 24-bit DMA operations might look like:

如果(dma_set_mask(dev,0xffffff))
    卡->use_dma = 1;
别的 {
    卡->use_dma = 0; /* 我们将不得不在没有 DMA 的情况下生活 */
    printk (KERN_WARN, "mydev: 不支持 DMA\n");
}
if (dma_set_mask (dev, 0xffffff))
    card->use_dma = 1;
else {
    card->use_dma = 0;   /* We'll have to live without DMA */
    printk (KERN_WARN, "mydev: DMA not supported\n");
}

同样,如果您的设备支持正常的 32 位 DMA 操作,则无需调用dma_set_mask

Again, if your device supports normal, 32-bit DMA operations, there is no need to call dma_set_mask.

DMA 映射

DMA mappings

DMA映射是分配 DMA 缓冲区并为该缓冲区生成设备可访问的地址的组合。通过简单调用virt_to_bus来获取该地址很诱人 ,但有充分的理由避免这种方法。第一个是合理的硬件配备IOMMU提供一组 映射寄存器 对于公共汽车。IOMMU 可以安排任何物理内存出现在设备可访问的地址范围内,并且它可以使物理上分散的缓冲区看起来与设备连续。使用 IOMMU 需要使用通用 DMA 层; virt_to_bus无法胜任这项任务。

A DMA mapping is a combination of allocating a DMA buffer and generating an address for that buffer that is accessible by the device. It is tempting to get that address with a simple call to virt_to_bus, but there are strong reasons for avoiding that approach. The first of those is that reasonable hardware comes with an IOMMU that provides a set of mapping registers for the bus. The IOMMU can arrange for any physical memory to appear within the address range accessible by the device, and it can cause physically scattered buffers to look contiguous to the device. Making use of the IOMMU requires using the generic DMA layer; virt_to_bus is not up to the task.

请注意,并非所有架构都有 IOMMU;特别是,流行的 x86 平台没有 IOMMU 支持。然而,正确编写的驱动程序不需要知道它正在运行的 I/O 支持硬件。

Note that not all architectures have an IOMMU; in particular, the popular x86 platform has no IOMMU support. A properly written driver need not be aware of the I/O support hardware it is running over, however.

为设备设置有用的地址在某些情况下,可能还需要建立一个反弹缓冲区。当驱动程序尝试对外围设备无法访问的地址(例如高内存地址)执行 DMA 时,就会创建反弹缓冲区。然后根据需要将数据复制到反弹缓冲区或从反弹缓冲区复制数据。不用说,使用反弹缓冲区会减慢速度,但有时别无选择。

Setting up a useful address for the device may also, in some cases, require the establishment of a bounce buffer. Bounce buffers are created when a driver attempts to perform DMA on an address that is not reachable by the peripheral device—a high-memory address, for example. Data is then copied to and from the bounce buffer as needed. Needless to say, use of bounce buffers can slow things down, but sometimes there is no alternative.

DMA 映射还必须解决缓存一致性问题。请记住,现代处理器将最近访问的内存区域的副本保存在快速的本地缓存中;如果没有这个缓存,就不可能有合理的性能。如果您的设备更改了主内存的某个区域,则覆盖该区域的任何处理器缓存都必须失效;否则,处理器可能会使用不正确的主内存映像,并导致数据损坏。同样,当您的设备使用 DMA 从主内存读取数据时,必须首先刷新驻留在处理器缓存中的内存的任何更改。这些缓存一致性 如果程序员不小心,问题可能会产生无数晦涩且难以发现的错误。一些架构管理硬件中的缓存一致性,但其他架构则需要软件支持。通用 DMA 层竭尽全力确保在所有架构上都能正常工作,但是,正如我们将看到的,正确的行为需要遵守一小组规则。

DMA mappings must also address the issue of cache coherency. Remember that modern processors keep copies of recently accessed memory areas in a fast, local cache; without this cache, reasonable performance is not possible. If your device changes an area of main memory, it is imperative that any processor caches covering that area be invalidated; otherwise the processor may work with an incorrect image of main memory, and data corruption results. Similarly, when your device uses DMA to read data from main memory, any changes to that memory residing in processor caches must be flushed out first. These cache coherency issues can create no end of obscure and difficult-to-find bugs if the programmer is not careful. Some architectures manage cache coherency in the hardware, but others require software support. The generic DMA layer goes to great lengths to ensure that things work correctly on all architectures, but, as we will see, proper behavior requires adherence to a small set of rules.

DMA 映射设置了一个新类型dma_addr_t来表示总线地址。驱动程序应将类型变量dma_addr_t视为不透明;唯一允许的操作是将它们传递给 DMA 支持例程和设备本身。作为总线地址,dma_addr_t如果直接被CPU使用可能会导致意想不到的问题。

The DMA mapping sets up a new type, dma_addr_t, to represent bus addresses. Variables of type dma_addr_t should be treated as opaque by the driver; the only allowable operations are to pass them to the DMA support routines and to the device itself. As a bus address, dma_addr_t may lead to unexpected problems if used directly by the CPU.

PCI 代码区分两种类型的 DMA 映射,具体取决于 DMA 缓冲区预计保留的时间:

The PCI code distinguishes between two types of DMA mappings, depending on how long the DMA buffer is expected to stay around:

相干 DMA 映射
Coherent DMA mappings

这些映射通常在驾驶员的一生中都存在。一致的缓冲区必须同时可供 CPU 和外设使用(正如我们稍后将看到的,其他类型的映射在任何给定时间只能供其中之一使用)。因此,一致的映射必须存在于高速缓存一致的内存中。相干映射的设置和使用成本可能很高。

These mappings usually exist for the life of the driver. A coherent buffer must be simultaneously available to both the CPU and the peripheral (other types of mappings, as we will see later, can be available only to one or the other at any given time). As a result, coherent mappings must live in cache-coherent memory. Coherent mappings can be expensive to set up and use.

流 DMA 映射
Streaming DMA mappings

流映射通常是为单个操作设置的。正如我们所看到的,某些架构允许在使用流映射时进行重大优化,但这些映射在访问方式方面也受到一组更严格的规则的约束。内核开发人员建议尽可能使用流映射而不是连贯映射。做出此建议有两个原因。第一个是,在支持映射寄存器的系统上,每个 DMA 映射都使用总线上的一个或多个寄存器。相干映射具有很长的生命周期,可以长时间独占这些寄存器,即使它们没有被使用。另一个原因是,在某些硬件上,

Streaming mappings are usually set up for a single operation. Some architectures allow for significant optimizations when streaming mappings are used, as we see, but these mappings also are subject to a stricter set of rules in how they may be accessed. The kernel developers recommend the use of streaming mappings over coherent mappings whenever possible. There are two reasons for this recommendation. The first is that, on systems that support mapping registers, each DMA mapping uses one or more of them on the bus. Coherent mappings, which have a long lifetime, can monopolize these registers for a long time, even when they are not being used. The other reason is that, on some hardware, streaming mappings can be optimized in ways that are not available to coherent mappings.

这两种映射类型必须以不同的方式进行操作;是时候看看细节了。

The two mapping types must be manipulated in different ways; it's time to look at the details.

设置相干 DMA 映射

Setting up coherent DMA mappings

驱动程序可以设置 通过调用dma_alloc_coherent进行一致映射:

A driver can set up a coherent mapping with a call to dma_alloc_coherent:

void *dma_alloc_coherent(struct device *dev, size_t 大小,
                         dma_addr_t *dma_handle,int 标志);
void *dma_alloc_coherent(struct device *dev, size_t size,
                         dma_addr_t *dma_handle, int flag);

该函数处理缓冲区的分配和映射。前两个参数是设备结构和所需缓冲区的大小。该函数在两个地方返回 DMA 映射的结果。函数的返回值是缓冲区的内核虚拟地址,可供驱动程序使用;同时,相关的总线地址在 中返回dma_handle。在此函数中处理分配,以便将缓冲区放置在与 DMA 配合使用的位置;通常内存只是用 get_free_pages分配的(但请注意,大小以字节为单位,而不是顺序值)。论点flag很平常 GFP_描述如何分配内存的值;它通常应该是GFP_KERNEL(通常)或GFP_ATOMIC(在原子上下文中运行时)。

This function handles both the allocation and the mapping of the buffer. The first two arguments are the device structure and the size of the buffer needed. The function returns the result of the DMA mapping in two places. The return value from the function is a kernel virtual address for the buffer, which may be used by the driver; the associated bus address, meanwhile, is returned in dma_handle. Allocation is handled in this function so that the buffer is placed in a location that works with DMA; usually the memory is just allocated with get_free_pages (but note that the size is in bytes, rather than an order value). The flag argument is the usual GFP_ value describing how the memory is to be allocated; it should usually be GFP_KERNEL (usually) or GFP_ATOMIC (when running in atomic context).

当不再需要缓冲区时(通常在模块卸载时),应使用dma_free_coherent将其返回给系统:

When the buffer is no longer needed (usually at module unload time), it should be returned to the system with dma_free_coherent:

void dma_free_coherent(结构设备 *dev, size_t 大小,
                        无效*vaddr,dma_addr_t dma_handle);
void dma_free_coherent(struct device *dev, size_t size,
                        void *vaddr, dma_addr_t dma_handle);

请注意,此函数与许多通用 DMA 函数一样,要求提供所有大小、CPU 地址和总线地址参数。

Note that this function, like many of the generic DMA functions, requires that all of the size, CPU address, and bus address arguments be provided.

DMA 池

DMA pools

DMA 池是一种分配用于小型、相干 DMA 映射的机制。从dma_alloc_coherent获得的映射 的最小大小可能为一页。如果您的设备需要比这更小的 DMA 区域,您可能应该使用 DMA 池。当您可能想对嵌入较大结构中的小区域执行 DMA 时,DMA 池也很有用。一些非常隐蔽的驱动程序错误已被追溯到与小型 DMA 区域相邻的结构字段的缓存一致性问题。为了避免这个问题,您应该始终显式地为 DMA 操作分配区域,远离其他非 DMA 数据结构。

A DMA pool is an allocation mechanism for small, coherent DMA mappings. Mappings obtained from dma_alloc_coherent may have a minimum size of one page. If your device needs smaller DMA areas than that, you should probably be using a DMA pool. DMA pools are also useful in situations where you may be tempted to perform DMA to small areas embedded within a larger structure. Some very obscure driver bugs have been traced down to cache coherency problems with structure fields adjacent to small DMA areas. To avoid this problem, you should always allocate areas for DMA operations explicitly, away from other, non-DMA data structures.

DMA 池函数在<linux/dmapool.h>中定义。

The DMA pool functions are defined in <linux/dmapool.h>.

在使用之前必须通过调用创建 DMA 池:

A DMA pool must be created before use with a call to:

struct dma_pool *dma_pool_create(const char *name, struct device *dev,
                                 size_t 大小、size_t 对齐、
                                 size_t 分配);
struct dma_pool *dma_pool_create(const char *name, struct device *dev, 
                                 size_t size, size_t align, 
                                 size_t allocation);

这里,name是池的名称,dev是设备结构,size是要从此池分配的缓冲区的大小,align是从池分配所需的硬件对齐(以字节表示),allocation 如果不为零,则为分配不应超过的内存边界。allocation例如,如果作为 4096 传递,则从此池分配的缓冲区不会跨越 4 KB 边界。

Here, name is a name for the pool, dev is your device structure, size is the size of the buffers to be allocated from this pool, align is the required hardware alignment for allocations from the pool (expressed in bytes), and allocation is, if nonzero, a memory boundary that allocations should not exceed. If allocation is passed as 4096, for example, the buffers allocated from this pool do not cross 4-KB boundaries.

当你用完泳池后,它可以被释放:

When you are done with a pool, it can be freed with:

void dma_pool_destroy(struct dma_pool *pool);
void dma_pool_destroy(struct dma_pool *pool);

在销毁池之前,您应该将所有分配返回到池中。

You should return all allocations to the pool before destroying it.

分配由dma_pool_alloc处理:

Allocations are handled with dma_pool_alloc:

void *dma_pool_alloc(struct dma_pool *pool, int mem_flags,
                     dma_addr_t *句柄);
void *dma_pool_alloc(struct dma_pool *pool, int mem_flags, 
                     dma_addr_t *handle);

对于此调用,mem_flags是通常的分配标志集 GFP_。如果一切顺利,则会分配并返回一个内存区域(其大小在创建池时指定)。与dma_alloc_coherent一样,生成的 DMA 缓冲区的地址作为内核虚拟地址返回并handle作为总线地址存储。

For this call, mem_flags is the usual set of GFP_ allocation flags. If all goes well, a region of memory (of the size specified when the pool was created) is allocated and returned. As with dma_alloc_coherent, the address of the resulting DMA buffer is returned as a kernel virtual address and stored in handle as a bus address.

不需要的缓冲区应通过以下方式返回到池中:

Unneeded buffers should be returned to the pool with:

void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t addr);
void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t addr);

设置流 DMA 映射

Setting up streaming DMA mappings

流映射有更多 界面比连贯多样复杂,原因有很多。这些映射期望使用驱动程序已分配的缓冲区,因此必须处理它们未选择的地址。在某些体系结构上,流映射还可以具有多个不连续的页面和多部分“分散/聚集”缓冲区。由于所有这些原因,流映射有自己的一组映射函数。

Streaming mappings have a more complicated interface than the coherent variety, for a number of reasons. These mappings expect to work with a buffer that has already been allocated by the driver and, therefore, have to deal with addresses that they did not choose. On some architectures, streaming mappings can also have multiple, discontiguous pages and multipart "scatter/gather" buffers. For all of these reasons, streaming mappings have their own set of mapping functions.

设置流映射时,必须告诉内核数据正在朝哪个方向移动。enum dma_data_direction为此目的定义了一些符号(类型为):

When setting up a streaming mapping, you must tell the kernel in which direction the data is moving. Some symbols (of type enum dma_data_direction) have been defined for this purpose:

DMA_TO_DEVICE

DMA_FROM_DEVICE
DMA_TO_DEVICE

DMA_FROM_DEVICE

这两个符号应该是不言自明的。如果数据正在发送到设备(也许是响应写入系统调用),DMA_TO_DEVICE则应使用;相反,进入 CPU 的数据标记为DMA_FROM_DEVICE

These two symbols should be reasonably self-explanatory. If data is being sent to the device (in response, perhaps, to a write system call), DMA_TO_DEVICE should be used; data going to the CPU, instead, is marked with DMA_FROM_DEVICE.

DMA_BIDIRECTIONAL
DMA_BIDIRECTIONAL

如果数据可以向任一方向移动,请使用DMA_BIDIRECTIONAL

If data can move in either direction, use DMA_BIDIRECTIONAL.

DMA_NONE
DMA_NONE

该符号仅作为调试辅助而提供。尝试按此“方向”使用缓冲区会导致内核恐慌。

This symbol is provided only as a debugging aid. Attempts to use buffers with this "direction" cause a kernel panic.

DMA_BIDIRECTIONAL 随时选择可能很诱人,但驱动程序作者应该抵制这种诱惑。在某些架构上,这种选择会带来性能损失。

It may be tempting to just pick DMA_BIDIRECTIONAL at all times, but driver authors should resist that temptation. On some architectures, there is a performance penalty to pay for that choice.

当你有一个缓冲区时 传输,用dma_map_single映射它:

When you have a single buffer to transfer, map it with dma_map_single:

dma_addr_t dma_map_single(结构设备 *dev, void *buffer, size_t 大小,
                          枚举 dma_data_direction 方向);
dma_addr_t dma_map_single(struct device *dev, void *buffer, size_t size, 
                          enum dma_data_direction direction);

返回值是您可以传递给设备或NULL出现问题时的总线地址。

The return value is the bus address that you can pass to the device or NULL if something goes wrong.

一旦转移 完成后,应使用dma_unmap_single删除映射:

Once the transfer is complete, the mapping should be deleted with dma_unmap_single:

void dma_unmap_single(结构设备 *dev, dma_addr_t dma_addr, size_t 大小,
                      枚举 dma_data_direction 方向);
void dma_unmap_single(struct device *dev, dma_addr_t dma_addr, size_t size, 
                      enum dma_data_direction direction);

这里,sizedirection参数必须与用于映射缓冲区的参数匹配。

Here, the size and direction arguments must match those used to map the buffer.

一些重要规则适用于流 DMA 映射:

Some important rules apply to streaming DMA mappings:

  • 缓冲区必须仅用于与映射时给定的方向值匹配的传输。

  • The buffer must be used only for a transfer that matches the direction value given when it was mapped.

  • 一旦缓冲区被映射,它就属于设备,而不是处理器。在缓冲区被取消映射之前,驱动程序不应以任何方式触及其内容。只有在调用dma_unmap_single后,驱动程序才能安全地访问缓冲区的内容(除了我们很快看到的一个例外)。除此之外,该规则意味着正在写入设备的缓冲区在包含所有要写入的数据之前无法映射。

  • Once a buffer has been mapped, it belongs to the device, not the processor. Until the buffer has been unmapped, the driver should not touch its contents in any way. Only after dma_unmap_single has been called is it safe for the driver to access the contents of the buffer (with one exception that we see shortly). Among other things, this rule implies that a buffer being written to a device cannot be mapped until it contains all the data to write.

  • 当 DMA 仍处于活动状态时,不得取消映射缓冲区,否则会导致系统严重不稳定。

  • The buffer must not be unmapped while DMA is still active, or serious system instability is guaranteed.

您可能想知道为什么驱动程序在映射缓冲区后就无法再使用它。事实上,这条规则之所以有意义有两个原因。首先,当缓冲区映射为 DMA 时,内核必须确保该缓冲区中的所有数据实际上已写入内存。当发出dma_unmap_single时,某些数据可能位于处理器的缓存中,并且必须显式刷新。刷新后处理器写入缓冲区的数据可能对设备不可见。

You may be wondering why the driver can no longer work with a buffer once it has been mapped. There are actually two reasons why this rule makes sense. First, when a buffer is mapped for DMA, the kernel must ensure that all of the data in that buffer has actually been written to memory. It is likely that some data is in the processor's cache when dma_unmap_single is issued, and must be explicitly flushed. Data written to the buffer by the processor after the flush may not be visible to the device.

其次,考虑如果要映射的缓冲区位于设备无法访问的内存区域中会发生什么情况。有些架构在这种情况下会失败,但其他架构会创建反弹缓冲区。反弹缓冲区只是设备可以访问的一个单独的内存区域 。如果缓冲区以 方向映射DMA_TO_DEVICE,并且需要反弹缓冲区,则原始缓冲区的内容将作为映射操作的一部分进行复制。显然,设备看不到复制后对原始缓冲区的更改。类似地,反弹缓冲区被dma_unmap_singleDMA_FROM_DEVICE复制回原始缓冲区; 在复制完成之前,设备中的数据不会出现。

Second, consider what happens if the buffer to be mapped is in a region of memory that is not accessible to the device. Some architectures simply fail in this case, but others create a bounce buffer. The bounce buffer is just a separate region of memory that is accessible to the device. If a buffer is mapped with a direction of DMA_TO_DEVICE, and a bounce buffer is required, the contents of the original buffer are copied as part of the mapping operation. Clearly, changes to the original buffer after the copy are not seen by the device. Similarly, DMA_FROM_DEVICE bounce buffers are copied back to the original buffer by dma_unmap_single; the data from the device is not present until that copy has been done.

顺便说一句,反弹缓冲区是正确方向很重要的原因之一。DMA_BIDIRECTIONAL反弹缓冲区在操作之前和之后都会被复制,这通常会不必要地浪费 CPU 周期。

Incidentally, bounce buffers are one reason why it is important to get the direction right. DMA_BIDIRECTIONAL bounce buffers are copied both before and after the operation, which is often an unnecessary waste of CPU cycles.

有时,驱动程序需要访问某个文件的内容流式 DMA 缓冲区,无需取消映射。已打过电话 为使之成为可能而提供:

Occasionally a driver needs to access the contents of a streaming DMA buffer without unmapping it. A call has been provided to make this possible:

无效dma_sync_single_for_cpu(结构设备* dev,dma_handle_t总线_地址,
                             size_t 大小,枚举 dma_data_direction 方向);
void dma_sync_single_for_cpu(struct device *dev, dma_handle_t bus_addr, 
                             size_t size, enum dma_data_direction direction);

应在处理器访问流式 DMA 缓冲区之前调用此函数。一旦发出调用,CPU“拥有”DMA 缓冲区并可以根据需要使用它。然而,在设备访问缓冲区之前,应使用以下命令将所有权转移回设备:

This function should be called before the processor accesses a streaming DMA buffer. Once the call has been made, the CPU "owns" the DMA buffer and can work with it as needed. Before the device accesses the buffer, however, ownership should be transferred back to it with:

无效dma_sync_single_for_device(结构设备* dev,dma_handle_t总线_地址,
                                size_t 大小,枚举 dma_data_direction 方向);
void dma_sync_single_for_device(struct device *dev, dma_handle_t bus_addr, 
                                size_t size, enum dma_data_direction direction);

执行此调用后,处理器不应再次访问 DMA 缓冲区。

The processor, once again, should not access the DMA buffer after this call has been made.

单页流映射

Single-page streaming mappings

有时,您可能想要 在您有指针的缓冲区上设置映射struct page;例如,使用 get_user_pages映射的用户空间缓冲区可能会发生这种情况。要使用指针设置和拆除流映射 struct page,请使用以下命令:

Occasionally, you may want to set up a mapping on a buffer for which you have a struct page pointer; this can happen, for example, with user-space buffers mapped with get_user_pages. To set up and tear down streaming mappings using struct page pointers, use the following:

dma_addr_t dma_map_page(结构设备 *dev, 结构页面 *page,
                        无符号长偏移量,size_t 大小,
                        枚举 dma_data_direction 方向);

void dma_unmap_page(结构设备 *dev, dma_addr_t dma_address,
                    size_t 大小,枚举 dma_data_direction 方向);
dma_addr_t dma_map_page(struct device *dev, struct page *page,
                        unsigned long offset, size_t size,
                        enum dma_data_direction direction);

void dma_unmap_page(struct device *dev, dma_addr_t dma_address, 
                    size_t size, enum dma_data_direction direction);

offset和参数size可用于映射页面的一部分。但是,建议避免部分页面映射,除非您确实确定自己在做什么。如果分配仅覆盖缓存行的一部分,则映射页面的一部分可能会导致缓存一致性问题;反过来,这可能会导致内存损坏和极难调试的错误。

The offset and size arguments can be used to map part of a page. It is recommended, however, that partial-page mappings be avoided unless you are really sure of what you are doing. Mapping part of a page can lead to cache coherency problems if the allocation covers only part of a cache line; that, in turn, can lead to memory corruption and extremely difficult-to-debug bugs.

分散/聚集映射

Scatter/gather mappings

分散/聚集映射是一种特殊类型的流 DMA 映射。假设您有多个缓冲区,所有这些缓冲区都需要与设备进行传输。这种情况可以通过多种方式发生,包括readvwritev系统调用、集群磁盘 I/O 请求或映射内核 I/O 缓冲区中的页面列表。您可以简单地依次映射每个缓冲区,并执行所需的操作,但一次映射整个列表有一些优点。

Scatter/gather mappings are a special type of streaming DMA mapping. Suppose you have several buffers, all of which need to be transferred to or from the device. This situation can come about in several ways, including from a readv or writev system call, a clustered disk I/O request, or a list of pages in a mapped kernel I/O buffer. You could simply map each buffer, in turn, and perform the required operation, but there are advantages to mapping the whole list at once.

许多设备可以接受数组指针和长度的分散列表,并在一次 DMA 操作中传输它们;例如,如果可以将数据包构建为多个部分,那么“零复制”网络就会更容易。将分散列表作为一个整体进行映射的另一个原因是利用在总线硬件中具有映射寄存器的系统。在此类系统上,从设备的角度来看,物理上不连续的页面可以组装成单个连续的数组。仅当分散列表中的条目长度等于页面大小(第一个和最后一个除外)时,此技术才起作用,但当它起作用时,它可以将多个操作转换为单个 DMA,并相应地加快速度。

Many devices can accept a scatterlist of array pointers and lengths, and transfer them all in one DMA operation; for example, "zero-copy" networking is easier if packets can be built in multiple pieces. Another reason to map scatterlists as a whole is to take advantage of systems that have mapping registers in the bus hardware. On such systems, physically discontiguous pages can be assembled into a single, contiguous array from the device's point of view. This technique works only when the entries in the scatterlist are equal to the page size in length (except the first and last), but when it does work, it can turn multiple operations into a single DMA, and speed things up accordingly.

最后,如果必须使用反弹缓冲区,则将整个列表合并到单个缓冲区中是有意义的(因为无论如何它都会被复制)。

Finally, if a bounce buffer must be used, it makes sense to coalesce the entire list into a single buffer (since it is being copied anyway).

现在您确信在某些情况下分散列表的映射是值得的。映射分散列表的第一步是创建并填充 struct scatterlist描述要传输的缓冲区的数组。该结构依赖于体系结构,并在<asm/scatterlist.h>中进行了描述。但是,它始终包含三个字段:

So now you're convinced that mapping of scatterlists is worthwhile in some situations. The first step in mapping a scatterlist is to create and fill in an array of struct scatterlist describing the buffers to be transferred. This structure is architecture dependent, and is described in <asm/scatterlist.h>. However, it always contains three fields:

struct page *page;
struct page *page;

struct page与分散/聚集操作中使用的缓冲区相对应的指针。

The struct page pointer corresponding to the buffer to be used in the scatter/gather operation.

unsigned int length;

unsigned int offset;
unsigned int length;

unsigned int offset;

该缓冲区的长度及其在页面内的偏移量

The length of that buffer and its offset within the page

要映射分散/聚集 DMA 操作,驱动程序应在要传输的每个缓冲区的条目中设置pageoffsetlength字段。struct scatterlist然后调用:

To map a scatter/gather DMA operation, your driver should set the page, offset, and length fields in a struct scatterlist entry for each buffer to be transferred. Then call:

int dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
               枚举 dma_data_direction 方向)
int dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
               enum dma_data_direction direction)

其中nents是传入的分散列表条目的数量。返回值是要传输的 DMA 缓冲区的数量;它可能小于nents

where nents is the number of scatterlist entries passed in. The return value is the number of DMA buffers to transfer; it may be less than nents.

对于输入分散列表中的每个缓冲区,dma_map_sg确定要提供给设备的正确总线地址。作为该任务的一部分,它还合并内存中彼此相邻的缓冲区。如果您的驱动程序运行的系统具有 I/O 内存管理单元,则dma_map_sg还会对该单元的映射寄存器进行编程,从设备的角度来看,可能的结果是您能够传输单个连续的缓冲区。然而,在通话结束之前,您永远不会知道最终的转接会是什么样子。

For each buffer in the input scatterlist, dma_map_sg determines the proper bus address to give to the device. As part of that task, it also coalesces buffers that are adjacent to each other in memory. If the system your driver is running on has an I/O memory management unit, dma_map_sg also programs that unit's mapping registers, with the possible result that, from your device's point of view, you are able to transfer a single, contiguous buffer. You will never know what the resulting transfer will look like, however, until after the call.

您的驱动程序应该传输pci_map_sg返回的每个缓冲区。每个缓冲区的总线地址和长度都存储在条目中struct scatterlist,但它们在结构中的位置因架构而异。定义了两个宏以便可以编写可移植代码:

Your driver should transfer each buffer returned by pci_map_sg. The bus address and length of each buffer are stored in the struct scatterlist entries, but their location in the structure varies from one architecture to the next. Two macros have been defined to make it possible to write portable code:

dma_addr_t sg_dma_address(struct scatterlist *sg);
dma_addr_t sg_dma_address(struct scatterlist *sg);

返回此分散列表条目的总线 (DMA) 地址。

Returns the bus (DMA) address from this scatterlist entry.

unsigned int sg_dma_len(struct scatterlist *sg);
unsigned int sg_dma_len(struct scatterlist *sg);

返回此缓冲区的长度。

Returns the length of this buffer.

再次记住,要传输的缓冲区的地址和长度可能与传递给dma_map_sg的地址和长度不同。

Again, remember that the address and length of the buffers to transfer may be different from what was passed in to dma_map_sg.

传输完成后,将通过调用 dma_unmap_sg取消分散/聚集映射的映射:

Once the transfer is complete, a scatter/gather mapping is unmapped with a call to dma_unmap_sg:

void dma_unmap_sg(struct device *dev, struct scatterlist *list,
                  int nents,枚举 dma_data_direction 方向);
void dma_unmap_sg(struct device *dev, struct scatterlist *list,
                  int nents, enum dma_data_direction direction);

请注意,这必须是您最初传递给dma_map_sgnents的条目数,而不是函数返回给您的 DMA 缓冲区数。

Note that nents must be the number of entries that you originally passed to dma_map_sg and not the number of DMA buffers the function returned to you.

分散/聚集映射是流 DMA 映射,并且与单一品种适用相同的访问规则。如果必须访问映射的分散/聚集列表,则必须首先同步它:

Scatter/gather mappings are streaming DMA mappings, and the same access rules apply to them as to the single variety. If you must access a mapped scatter/gather list, you must synchronize it first:

void dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg,
                         int nents,枚举 dma_data_direction 方向);
void dma_sync_sg_for_device(结构设备*dev,结构分散列表*sg,
                         int nents,枚举 dma_data_direction 方向);
void dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg,
                         int nents, enum dma_data_direction direction);
void dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg,
                         int nents, enum dma_data_direction direction);

PCI双地址周期映射

PCI double-address cycle mappings

一般情况下,DMA支持 层使用 32 位总线地址,可能受到特定设备的 DMA 掩码的限制。然而 PCI 总线也支持 64 位寻址模式,即双地址周期(数模转换器)。通用 DMA 层不支持此模式有几个原因,首先是它是 PCI 特定的功能。此外,DAC 的许多实现充其量也存在缺陷,并且由于 DAC 比常规 32 位 DMA 慢,因此可能会产生性能成本。即便如此,在某些应用中使用 DAC 可能是正确的选择。如果您的设备可能需要使用位于高内存中的非常大的缓冲区,您可能需要考虑实现 DAC 支持。此支持仅适用于 PCI 总线,因此必须使用 PCI 特定的例程。

Normally, the DMA support layer works with 32-bit bus addresses, possibly restricted by a specific device's DMA mask. The PCI bus, however, also supports a 64-bit addressing mode, the double-address cycle (DAC). The generic DMA layer does not support this mode for a couple of reasons, the first of which being that it is a PCI-specific feature. Also, many implementations of DAC are buggy at best, and, because DAC is slower than a regular, 32-bit DMA, there can be a performance cost. Even so, there are applications where using DAC can be the right thing to do; if you have a device that is likely to be working with very large buffers placed in high memory, you may want to consider implementing DAC support. This support is available only for the PCI bus, so PCI-specific routines must be used.

要使用 DAC,您的驱动程序必须包含<linux/pci.h>。您必须设置单独的 DMA 掩码:

To use DAC, your driver must include <linux/pci.h>. You must set a separate DMA mask:

int pci_dac_set_dma_mask(struct pci_dev *pdev, u64 掩码);
int pci_dac_set_dma_mask(struct pci_dev *pdev, u64 mask);

仅当此调用返回时,您才能使用 DAC 寻址0

You can use DAC addressing only if this call returns 0.

特殊类型 ( dma64_addr_t) 用于 DAC 映射。要建立这些映射之一,请调用 pci_dac_page_to_dma

A special type (dma64_addr_t) is used for DAC mappings. To establish one of these mappings, call pci_dac_page_to_dma:

dma64_addr_t pci_dac_page_to_dma(struct pci_dev *pdev, struct page *page,
                                 无符号长偏移量,int 方向);
dma64_addr_t pci_dac_page_to_dma(struct pci_dev *pdev, struct page *page, 
                                 unsigned long offset, int direction);

您会注意到,DAC 映射只能通过指针进行struct page(毕竟,它们应该位于高端内存中,否则使用它们就没有意义);它们必须一次创建一个页面。该参数与通用 DMA 层中使用的directionPCI 等效 ;enum dma_data_direction应该是PCI_DMA_TODEVICEPCI_DMA_FROMDEVICE、 或PCI_DMA_BIDIRECTIONAL

DAC mappings, you will notice, can be made only from struct page pointers (they should live in high memory, after all, or there is no point in using them); they must be created a single page at a time. The direction argument is the PCI equivalent of the enum dma_data_direction used in the generic DMA layer; it should be PCI_DMA_TODEVICE, PCI_DMA_FROMDEVICE, or PCI_DMA_BIDIRECTIONAL.

DAC映射不需要外部资源,因此使用后无需显式释放它们。然而,有必要像其他流映射一样对待 DAC 映射,并遵守有关缓冲区所有权的规则。有一组用于同步的函数 DMA 缓冲区类似于通用类型:

DAC mappings require no external resources, so there is no need to explicitly release them after use. It is necessary, however, to treat DAC mappings like other streaming mappings, and observe the rules regarding buffer ownership. There is a set of functions for synchronizing DMA buffers that is analogous to the generic variety:

无效 pci_dac_dma_sync_single_for_cpu(struct pci_dev *pdev,
                                     dma64_addr_t dma_addr,
                                     size_t 长度,
                                     整数方向);

无效 pci_dac_dma_sync_single_for_device(struct pci_dev *pdev,
                                        dma64_addr_t dma_addr,
                                        size_t 长度,
                                        整数方向);
void pci_dac_dma_sync_single_for_cpu(struct pci_dev *pdev,
                                     dma64_addr_t dma_addr,
                                     size_t len,
                                     int direction);

void pci_dac_dma_sync_single_for_device(struct pci_dev *pdev,
                                        dma64_addr_t dma_addr,
                                        size_t len,
                                        int direction);

一个简单的 PCI DMA 示例

A simple PCI DMA example

作为如何使用 DMA 映射的示例,我们提供了 PCI 设备的 DMA 编码的简单示例。PCI 总线上 DMA 操作的实际形式很大程度上取决于所驱动的设备。因此,该示例不适用于任何实际设备;相反,它是名为“dad” (DMA 采集设备)的假设驱动程序的一部分。该设备的驱动程序可能定义如下传输函数:

As an example of how the DMA mappings might be used, we present a simple example of DMA coding for a PCI device. The actual form of DMA operations on the PCI bus is very dependent on the device being driven. Thus, this example does not apply to any real device; instead, it is part of a hypothetical driver called dad (DMA Acquisition Device). A driver for this device might define a transfer function like this:

int bad_transfer(struct bad_dev *dev, int write, void *buffer,
                 size_t 计数)
{
    dma_addr_t 总线地址;

    /* 映射 DMA 缓冲区 */
    dev->dma_dir =(写入?DMA_TO_DEVICE:DMA_FROM_DEVICE);
    dev->dma_size = 计数;
    bus_addr = dma_map_single(&dev->pci_dev->dev, 缓冲区, 计数,
                              dev->dma_dir);
    dev->dma_addr = 总线_地址;

    /* 设置设备 */

    writeb(dev->registers.command, DAD_CMD_DISABLEDMA);
    writeb(dev->registers.command, 写入? DAD_CMD_WR : DAD_CMD_RD);
    writel(dev->registers.addr, cpu_to_le32(bus_addr));
    writel(dev->registers.len, cpu_to_le32(count));

    /* 开始操作 */
    writeb(dev->registers.command, DAD_CMD_ENABLEDMA);
    返回0;
}
int dad_transfer(struct dad_dev *dev, int write, void *buffer, 
                 size_t count)
{
    dma_addr_t bus_addr;

    /* Map the buffer for DMA */
    dev->dma_dir = (write ? DMA_TO_DEVICE : DMA_FROM_DEVICE);
    dev->dma_size = count;
    bus_addr = dma_map_single(&dev->pci_dev->dev, buffer, count, 
                              dev->dma_dir);
    dev->dma_addr = bus_addr;

    /* Set up the device */

    writeb(dev->registers.command, DAD_CMD_DISABLEDMA);
    writeb(dev->registers.command, write ? DAD_CMD_WR : DAD_CMD_RD);
    writel(dev->registers.addr, cpu_to_le32(bus_addr));
    writel(dev->registers.len, cpu_to_le32(count));

    /* Start the operation */
    writeb(dev->registers.command, DAD_CMD_ENABLEDMA);
    return 0;
}

该函数映射要传输的缓冲区并启动设备操作。另一半工作必须在中断服务例程中完成,如下所示:

This function maps the buffer to be transferred and starts the device operation. The other half of the job must be done in the interrupt service routine, which looks something like this:

void badd_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
    struct bad_dev *dev = (struct bad_dev *) dev_id;

    /* 确保确实是我们的设备中断了 */

    /* 取消映射 DMA 缓冲区 */
    dma_unmap_single(dev->pci_dev->dev, dev->dma_addr,
                     dev->dma_size, dev->dma_dir);

    /* 只有现在访问缓冲区、复制给用户等才是安全的 */
    ...
}
void dad_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
    struct dad_dev *dev = (struct dad_dev *) dev_id;

    /* Make sure it's really our device interrupting */

    /* Unmap the DMA buffer */
    dma_unmap_single(dev->pci_dev->dev, dev->dma_addr, 
                     dev->dma_size, dev->dma_dir);

    /* Only now is it safe to access the buffer, copy to user, etc. */
    ...
}

显然,该示例省略了大量细节,包括防止尝试启动多个同时 DMA 操作可能需要的任何步骤。

Obviously, a great deal of detail has been left out of this example, including whatever steps may be required to prevent attempts to start multiple, simultaneous DMA operations.

ISA 设备的 DMA

DMA for ISA Devices

ISA 总线允许两种 DMA 传输:本机 DMA 和 ISA 总线主控 DMA。Native DMA 使用主板上的标准 DMA 控制器电路来驱动 ISA 总线上的信号线。另一方面,ISA 总线主控 DMA 完全由外围设备处理。后一种类型的 DMA 很少使用,因此不需要在这里讨论,因为它与 PCI 设备的 DMA 类似,至少从驱动程序的角度来看是这样。ISA 总线主控的一个例子是 1542 SCSI 控制器,其驱动程序是内核源代码中的drivers/scsi/aha1542.c 。

The ISA bus allows for two kinds of DMA transfers: native DMA and ISA bus master DMA. Native DMA uses standard DMA-controller circuitry on the motherboard to drive the signal lines on the ISA bus. ISA bus master DMA, on the other hand, is handled entirely by the peripheral device. The latter type of DMA is rarely used and doesn't require discussion here, because it is similar to DMA for PCI devices, at least from the driver's point of view. An example of an ISA bus master is the 1542 SCSI controller, whose driver is drivers/scsi/aha1542.c in the kernel sources.

就本机 DMA 而言,ISA 总线上的 DMA 数据传输涉及三个实体:

As far as native DMA is concerned, there are three entities involved in a DMA data transfer on the ISA bus:

8237 DMA 控制器 (DMAC)
The 8237 DMA controller (DMAC)

控制器保存有关 DMA 传输的信息,例如传输方向、内存地址和大小。它还包含一个跟踪正在进行的传输状态的计数器。当控制器接收到 DMA 请求信号时,它会获得总线的控制权并驱动信号线,以便设备可以读取或写入其数据。

The controller holds information about the DMA transfer, such as the direction, the memory address, and the size of the transfer. It also contains a counter that tracks the status of ongoing transfers. When the controller receives a DMA request signal, it gains control of the bus and drives the signal lines so that the device can read or write its data.

外围设备
The peripheral device

当设备准备好传输数据时,必须激活 DMA 请求信号。实际的传输由DMAC管理;当控制器选通设备时,硬件设备顺序地在总线上读取或写入数据。当传输结束时,设备通常会引发中断。

The device must activate the DMA request signal when it's ready to transfer data. The actual transfer is managed by the DMAC; the hardware device sequentially reads or writes data onto the bus when the controller strobes the device. The device usually raises an interrupt when the transfer is over.

设备驱动程序
The device driver

司机几乎无事可做;它为 DMA 控制器提供传输方向、总线地址和大小。它还与其外设通信,准备传输数据,并在 DMA 结束时响应中断。

The driver has little to do; it provides the DMA controller with the direction, bus address, and size of the transfer. It also talks to its peripheral to prepare it for transferring the data and responds to the interrupt when the DMA is over.

PC 中使用的原始 DMA 控制器可以管理四个“通道”,每个通道与一组 DMA 寄存器相关联。四个设备可以同时将其 DMA 信息存储在控制器中。较新的 PC 包含相当于两个 DMAC 设备:[ 6 ]第二个控制器(主控制器)连接到系统处理器,第一个控制器(从控制器)连接到0第二个控制器的通道。[ 7 ]

The original DMA controller used in the PC could manage four "channels," each associated with one set of DMA registers. Four devices could store their DMA information in the controller at the same time. Newer PCs contain the equivalent of two DMAC devices:[6] the second controller (master) is connected to the system processor, and the first (slave) is connected to channel 0 of the second controller.[7]

通道编号为 0-7:通道 4 不可用于 ISA 外设,因为它在内部用于将从控制器级联到主控制器。因此,从站上的可用通道(8 位通道)为 0-3,主站上的可用通道为 5-7(16 位通道)。任何 DMA 传输的大小(存储在控制器中)都是一个 16 位数字,表示总线周期数。因此,从控制器的最大传输大小为 64 KB(因为它在一个周期内传输 8 位),主控制器的最大传输大小为 128 KB(执行 16 位传输)。

The channels are numbered from 0-7: channel 4 is not available to ISA peripherals, because it is used internally to cascade the slave controller onto the master. The available channels are, thus, 0-3 on the slave (the 8-bit channels) and 5-7 on the master (the 16-bit channels). The size of any DMA transfer, as stored in the controller, is a 16-bit number representing the number of bus cycles. The maximum transfer size is, therefore, 64 KB for the slave controller (because it transfers eight bits in one cycle) and 128 KB for the master (which does 16-bit transfers).

由于 DMA 控制器是系统范围的资源,因此内核会帮助处理它。它使用 DMA 注册表为 DMA 通道提供请求和释放机制,并使用一组函数来配置 DMA 控制器中的通道信息。

Because the DMA controller is a system-wide resource, the kernel helps deal with it. It uses a DMA registry to provide a request-and-free mechanism for the DMA channels and a set of functions to configure channel information in the DMA controller.

注册 DMA 使用

Registering DMA usage

您应该习惯内核注册表——我们已经见过它们用于 I/O 端口和中断线。DMA 通道注册表与其他通道注册表类似。包含<asm/dma.h>后,可以使用以下函数来获取和释放 DMA 通道的所有权:

You should be used to kernel registries—we've already seen them for I/O ports and interrupt lines. The DMA channel registry is similar to the others. After <asm/dma.h> has been included, the following functions can be used to obtain and release ownership of a DMA channel:

int request_dma(unsigned int 通道, const char *name);
void free_dma(无符号整数通道);
int request_dma(unsigned int channel, const char *name); 
void free_dma(unsigned int channel);

参数channel是 0 到 7 之间的数字,或更准确地说,是小于 的正数MAX_DMA_CHANNELS。在PC上,MAX_DMA_CHANNELS被定义为8与硬件相匹配。参数name是标识设备的字符串。指定的名称出现在文件/proc/dma中,用户程序可以读取该文件。

The channel argument is a number between 0 and 7 or, more precisely, a positive number less than MAX_DMA_CHANNELS. On the PC, MAX_DMA_CHANNELS is defined as 8 to match the hardware. The name argument is a string identifying the device. The specified name appears in the file /proc/dma, which can be read by user programs.

request_dma的返回值表示0成功-EINVAL-EBUSY出现错误。前者意味着请求的通道超出范围,后者意味着另一个设备正在占用该通道。

The return value from request_dma is 0 for success and -EINVAL or -EBUSY if there was an error. The former means that the requested channel is out of range, and the latter means that another device is holding the channel.

我们建议您像对待 I/O 端口和中断线一样对待 DMA 通道;在打开时请求通道 比从模块初始化函数中请求要好得多。延迟请求允许驱动程序之间进行一些共享;例如,您的声卡和模拟 I/O 接口可以共享 DMA 通道,只要它们不同时使用即可。

We recommend that you take the same care with DMA channels as with I/O ports and interrupt lines; requesting the channel at open time is much better than requesting it from the module initialization function. Delaying the request allows some sharing between drivers; for example, your sound card and your analog I/O interface can share the DMA channel as long as they are not used at the same time.

我们还建议您在请求中断线之后请求 DMA 通道,并在中断之前释放它。这是请求这两种资源的常规顺序;遵循约定可以避免可能的僵局。请注意,每个使用 DMA 的设备也需要一条 IRQ 线;否则,它无法发出数据传输完成的信号。

We also suggest that you request the DMA channel after you've requested the interrupt line and that you release it before the interrupt. This is the conventional order for requesting the two resources; following the convention avoids possible deadlocks. Note that every device using DMA needs an IRQ line as well; otherwise, it couldn't signal the completion of data transfer.

在典型情况下, open的代码如下所示,它指的是我们假设的爸爸模块。如图所示,爸爸 设备使用快速中断处理程序不支持共享 IRQ 线路。

In a typical case, the code for open looks like the following, which refers to our hypothetical dad module. The dad device as shown uses a fast interrupt handler without support for shared IRQ lines.

int bad_open(结构 inode *inode,结构文件 *filp)
{
    结构 bad_device *my_device;

    /* ... */
    如果((错误= request_irq(my_device.irq,dad_interrupt,
                              SA_INTERRUPT,“爸爸”,NULL)))
        返回错误;/* 或实现阻塞打开 */

    if ( (error = request_dma(my_device.dma, "爸爸")) ) {
        free_irq(my_device.irq, NULL);
        返回错误;/* 或实现阻塞打开 */
    }
    /* ... */
    返回0;
}
int dad_open (struct inode *inode, struct file *filp)
{
    struct dad_device *my_device; 

    /* ... */
    if ( (error = request_irq(my_device.irq, dad_interrupt,
                              SA_INTERRUPT, "dad", NULL)) )
        return error; /* or implement blocking open */

    if ( (error = request_dma(my_device.dma, "dad")) ) {
        free_irq(my_device.irq, NULL);
        return error; /* or implement blocking open */
    }
    /* ... */
    return 0;
}

与刚刚显示的open相匹配的 close实现如下所示:

The close implementation that matches the open just shown looks like this:

void bad_close(结构 inode *inode,结构文件 *filp)
{
    结构 bad_device *my_device;

    /* ... */
    free_dma(my_device.dma);
    free_irq(my_device.irq, NULL);
    /* ... */
}
void dad_close (struct inode *inode, struct file *filp)
{
    struct dad_device *my_device;

    /* ... */
    free_dma(my_device.dma);
    free_irq(my_device.irq, NULL);
    /* ... */
}

以下是/proc/dma文件在安装了声卡的系统上的外观:

Here's how the /proc/dma file looks on a system with the sound card installed:

梅林诺%cat /proc/dma
 1:声霸8
 4:级联
merlino% cat /proc/dma
 1: Sound Blaster8
 4: cascade

有趣的是,默认的声音驱动程序在系统启动时获取 DMA 通道,并且从不释放它。该cascade条目是一个占位符,指示驱动程序无法使用通道 4,如前所述。

It's interesting to note that the default sound driver gets the DMA channel at system boot and never releases it. The cascade entry is a placeholder, indicating that channel 4 is not available to drivers, as explained earlier.

与 DMA 控制器对话

Talking to the DMA controller

注册后,驱动程序工作的主要部分包括配置 DMA 控制器以实现正确操作。这项任务并不简单,但幸运的是,内核导出了典型驱动程序所需的所有功能。

After registration, the main part of the driver's job consists of configuring the DMA controller for proper operation. This task is not trivial, but fortunately, the kernel exports all the functions needed by the typical driver.

驱动程序需要在调用读取写入时或准备异步传输时配置 DMA 控制器。后一个任务可以在打开时执行,也可以响应ioctl命令执行,具体取决于驱动程序及其实现的策略。此处显示的代码通常由读取写入设备方法调用 。

The driver needs to configure the DMA controller either when read or write is called, or when preparing for asynchronous transfers. This latter task is performed either at open time or in response to an ioctl command, depending on the driver and the policy it implements. The code shown here is the code that is typically called by the read or write device methods.

本小节提供 DMA 控制器内部结构的快速概述,以便您了解此处介绍的代码。如果您想了解更多信息,我们强烈建议您阅读<asm/dma.h>和一些描述 PC 架构的硬件手册。特别是,我们不处理 8 位与 16 位数据传输的问题。如果您正在为 ISA 设备板编写设备驱动程序,则应在设备的硬件手册中找到相关信息。

This subsection provides a quick overview of the internals of the DMA controller so you understand the code introduced here. If you want to learn more, we'd urge you to read <asm/dma.h> and some hardware manuals describing the PC architecture. In particular, we don't deal with the issue of 8-bit versus 16-bit data transfers. If you are writing device drivers for ISA device boards, you should find the relevant information in the hardware manuals for the devices.

DMA 控制器是一种共享资源,如果多个处理器尝试同时对其进行编程,则可能会出现混乱。因此,控制器受到称为 的自旋锁的保护 dma_spin_lock。驾驶员不应直接操作锁;但是,我们提供了两个函数来帮助您完成此操作:

The DMA controller is a shared resource, and confusion could arise if more than one processor attempts to program it simultaneously. For that reason, the controller is protected by a spinlock, called dma_spin_lock. Drivers should not manipulate the lock directly; however, two functions have been provided to do that for you:

unsigned long claim_dma_lock( );
unsigned long claim_dma_lock( );

获取 DMA 自旋锁。该函数还阻止本地处理器上的中断;因此,返回值是一组描述先前中断状态的标志;完成锁定后,必须将其传递给以下函数以恢复中断状态。

Acquires the DMA spinlock. This function also blocks interrupts on the local processor; therefore, the return value is a set of flags describing the previous interrupt state; it must be passed to the following function to restore the interrupt state when you are done with the lock.

void release_dma_lock(unsigned long flags);
void release_dma_lock(unsigned long flags);

返回DMA自旋锁并恢复之前的中断状态。

Returns the DMA spinlock and restores the previous interrupt status.

使用下面描述的函数时应保持自旋锁。但是,不应在实际 I/O 期间保留它 。驱动程序在持有自旋锁时不应该睡觉。

The spinlock should be held when using the functions described next. It should not be held during the actual I/O, however. A driver should never sleep when holding a spinlock.

必须加载到控制器中的信息由三项组成:RAM 地址、必须传输的原子项数(以字节或字为单位)以及传输方向。为此, <asm/dma.h>导出以下函数:

The information that must be loaded into the controller consists of three items: the RAM address, the number of atomic items that must be transferred (in bytes or words), and the direction of the transfer. To this end, the following functions are exported by <asm/dma.h>:

void set_dma_mode(unsigned int channel, char mode);
void set_dma_mode(unsigned int channel, char mode);

指示通道是否必须从设备读取 ( DMA_MODE_READ) 还是向设备写入 ( DMA_MODE_WRITE)。存在第三种模式,DMA_MODE_CASCADE用于释放总线的控制。级联是第一个控制器连接到第二个控制器顶部的方式,但它也可以由真正的 ISA 总线主控设备使用。我们不会在这里讨论总线控制。

Indicates whether the channel must read from the device (DMA_MODE_READ) or write to it (DMA_MODE_WRITE). A third mode exists, DMA_MODE_CASCADE, which is used to release control of the bus. Cascading is the way the first controller is connected to the top of the second, but it can also be used by true ISA bus-master devices. We won't discuss bus mastering here.

void set_dma_addr(unsigned int channel, unsigned int addr);
void set_dma_addr(unsigned int channel, unsigned int addr);

分配 DMA 缓冲区的地址。addr该函数存储控制器中的 24 个最低有效位。该addr参数必须是总线 地址(请参阅 本章前面的第 15.4.3 节)。

Assigns the address of the DMA buffer. The function stores the 24 least significant bits of addr in the controller. The addr argument must be a bus address (see the Section 15.4.3 earlier in this chapter).

void set_dma_count(unsigned int channel, unsigned int count);
void set_dma_count(unsigned int channel, unsigned int count);

指定要传输的字节数。该count 参数也表示 16 位通道的字节;在这种情况下,数字 必须是偶数。

Assigns the number of bytes to transfer. The count argument represents bytes for 16-bit channels as well; in this case, the number must be even.

除了这些函数之外,在处理 DMA 设备时还必须使用许多内务工具:

In addition to these functions, there are a number of housekeeping facilities that must be used when dealing with DMA devices:

void disable_dma(unsigned int channel);
void disable_dma(unsigned int channel);

可以在控制器内禁用 DMA 通道。在配置控制器之前应禁用该通道,以防止不当操作。(否则,可能会发生损坏,因为控制器是通过 8 位数据传输进行编程的,因此前面的功能都不会自动执行)。

A DMA channel can be disabled within the controller. The channel should be disabled before the controller is configured to prevent improper operation. (Otherwise, corruption can occur because the controller is programmed via 8-bit data transfers and, therefore, none of the previous functions is executed atomically).

void enable_dma(unsigned int channel);
void enable_dma(unsigned int channel);

该函数告诉控制器 DMA 通道包含有效数据。

This function tells the controller that the DMA channel contains valid data.

int get_dma_residue(unsigned int channel);
int get_dma_residue(unsigned int channel);

驱动程序有时需要知道 DMA 传输是否已完成。该函数返回仍待传输的字节数。返回值是0在成功传输之后,并且0在控制器工作时是不可预测的(但不是)。不可预测性源于需要通过两个 8 位输入运算获得 16 位余数。

The driver sometimes needs to know whether a DMA transfer has been completed. This function returns the number of bytes that are still to be transferred. The return value is 0 after a successful transfer and is unpredictable (but not 0) while the controller is working. The unpredictability springs from the need to obtain the 16-bit residue through two 8-bit input operations.

void clear_dma_ff(unsigned int channel)
void clear_dma_ff(unsigned int channel)

该函数清除 DMA 触发器。触发器用于控制对 16 位寄存器的访问。通过两个连续的 8 位操作来访问寄存器,触发器用于选择最低有效字节(当清零时)或最高有效字节(当设置时)。当八位传输完毕后,触发器会自动切换;在访问 DMA 寄存器之前,程序员必须清零触发器(将其设置为已知状态)。

This function clears the DMA flip-flop. The flip-flop is used to control access to 16-bit registers. The registers are accessed by two consecutive 8-bit operations, and the flip-flop is used to select the least significant byte (when it is clear) or the most significant byte (when it is set). The flip-flop automatically toggles when eight bits have been transferred; the programmer must clear the flip-flop (to set it to a known state) before accessing the DMA registers.

使用这些函数,驱动程序可以实现如下函数来准备 DMA 传输:

Using these functions, a driver can implement a function like the following to prepare for a DMA transfer:

int bad_dma_prepare(int 通道, int 模式, 无符号 int buf,
                    无符号整数计数)
{
    无符号长标志;

    标志=claim_dma_lock();
    禁用_DMA(通道);
    清除 dma_ff(通道);
    set_dma_mode(通道,模式);
    set_dma_addr(通道, virt_to_bus(buf));
    set_dma_count(通道, 计数);
    启用_DMA(通道);
    release_dma_lock(标志);

    返回0;
}
int dad_dma_prepare(int channel, int mode, unsigned int buf,
                    unsigned int count)
{
    unsigned long flags;

    flags = claim_dma_lock(  );
    disable_dma(channel);
    clear_dma_ff(channel);
    set_dma_mode(channel, mode);
    set_dma_addr(channel, virt_to_bus(buf));
    set_dma_count(channel, count);
    enable_dma(channel);
    release_dma_lock(flags);

    return 0;
}

然后,使用类似下一个的函数来检查是否成功 DMA完成:

Then, a function like the next one is used to check for successful completion of DMA:

int bad_dma_isdone(int 通道)
{
    整数残差;
    无符号长标志 = Claim_dma_lock ( );
    残差 = get_dma_residue(通道);
    release_dma_lock(标志);
    返回(残数==0);
}
int dad_dma_isdone(int channel)
{
    int residue;
    unsigned long flags = claim_dma_lock (  );
    residue = get_dma_residue(channel);
    release_dma_lock(flags);
    return (residue =  = 0);
}

唯一需要做的就是配置设备板。这种特定于设备的任务通常包括读取或写入一些 I/O 端口。设备之间存在显着差异。例如,某些设备希望程序员告诉硬件 DMA 缓冲区有多大,有时驱动程序必须读取硬连线到设备中的值。用于配置 板,硬件手册是你的 仅有的朋友。

The only thing that remains to be done is to configure the device board. This device-specific task usually consists of reading or writing a few I/O ports. Devices differ in significant ways. For example, some devices expect the programmer to tell the hardware how big the DMA buffer is, and sometimes the driver has to read a value that is hardwired into the device. For configuring the board, the hardware manual is your only friend.

快速参考

Quick Reference

本章介绍了以下与内存处理相关的符号。

This chapter introduced the following symbols related to memory handling.

介绍材料

Introductory Material

#include <linux/mm.h>

#include <asm/page.h>
#include <linux/mm.h>

#include <asm/page.h>

大多数与内存管理相关的函数和结构都是在这些头文件中原型化和定义的。

Most of the functions and structures related to memory management are prototyped and defined in these header files.

void *_ _va(unsigned long physaddr);

unsigned long _ _pa(void *kaddr);
void *_ _va(unsigned long physaddr);

unsigned long _ _pa(void *kaddr);

在内核逻辑地址和物理地址之间转换的宏。

Macros that convert between kernel logical addresses and physical addresses.

PAGE_SIZE

PAGE_SHIFT
PAGE_SIZE

PAGE_SHIFT

给出底层硬件上页的大小(以字节为单位)以及页帧号必须移位以将其转换为物理地址的位数的常量。

Constants that give the size (in bytes) of a page on the underlying hardware and the number of bits that a page frame number must be shifted to turn it into a physical address.

struct page
struct page

表示系统内存映射中的硬件页面的结构。

Structure that represents a hardware page in the system memory map.

struct page *virt_to_page(void *kaddr);

void *page_address(struct page *page);

struct page *pfn_to_page(int pfn);
struct page *virt_to_page(void *kaddr);

void *page_address(struct page *page);

struct page *pfn_to_page(int pfn);

在内核逻辑地址及其关联的内存映射条目之间进行转换的宏。page_address仅适用于已显式映射的低内存页或高内存页。pfn_to_page 将页框号转换为其关联的struct page指针。

Macros that convert between kernel logical addresses and their associated memory map entries. page_address works only for low-memory pages or high-memory pages that have been explicitly mapped. pfn_to_page converts a page frame number to its associated struct page pointer.

unsigned long kmap(struct page *page);

void kunmap(struct page *page);
unsigned long kmap(struct page *page);

void kunmap(struct page *page);

kmap返回映射到给定页面的内核虚拟地址,并在需要时创建映射。kunmap删除给定页面的映射。

kmap returns a kernel virtual address that is mapped to the given page, creating the mapping if need be. kunmap deletes the mapping for the given page.

#include <linux/highmem.h>

#include <asm/kmap_types.h>

void *kmap_atomic(struct page *page, enum km_type type);

void kunmap_atomic(void *addr, enum km_type type);
#include <linux/highmem.h>

#include <asm/kmap_types.h>

void *kmap_atomic(struct page *page, enum km_type type);

void kunmap_atomic(void *addr, enum km_type type);

kmap的高性能版本;结果映射只能由原子代码保存。对于驱动程序,type应为KM_USER0KM_USER1KM_IRQ0KM_IRQ1

The high-performance version of kmap; the resulting mappings can be held only by atomic code. For drivers, type should be KM_USER0, KM_USER1, KM_IRQ0, or KM_IRQ1.

struct vm_area_struct;
struct vm_area_struct;

描述 VMA 的结构。

Structure describing a VMA.

实现映射

Implementing mmap

int remap_pfn_range(struct vm_area_struct *vma, unsigned long virt_add,

,unsigned long pfn, unsigned long size, pgprot_t prot);

int io_remap_page_range(struct vm_area_struct *vma, unsigned long virt_add

unsigned long phys_add, unsigned long size, pgprot_t prot);
int remap_pfn_range(struct vm_area_struct *vma, unsigned long virt_add,

unsigned long pfn, unsigned long size, pgprot_t prot);

int io_remap_page_range(struct vm_area_struct *vma, unsigned long virt_add,

unsigned long phys_add, unsigned long size, pgprot_t prot);

功能 它位于mmap的核心。它们将size物理地址的字节从由 指示的页号开始映射pfn到虚拟地址virt_add。与虚拟空间相关的保护位在 中指定prot。 当目标地址位于 I/O 内存空间中时,应使用io_remap_page_range 。

Functions that sit at the heart of mmap. They map size bytes of physical addresses, starting at the page number indicated by pfn to the virtual address virt_add. The protection bits associated with the virtual space are specified in prot. io_remap_page_range should be used when the target address is in I/O memory space.

struct page *vmalloc_to_page(void *vmaddr);
struct page *vmalloc_to_page(void *vmaddr);

将从vmalloc获得的内核虚拟地址转换为其对应的struct page指针。

Converts a kernel virtual address obtained from vmalloc to its corresponding struct page pointer.

实施直接 I/O

Implementing Direct I/O

int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned

long start, int len, int write, int force, struct page **pages, struct

vm_area_struct **vmas);
int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned

long start, int len, int write, int force, struct page **pages, struct

vm_area_struct **vmas);

将用户空间缓冲区锁定到内存并返回相应struct page指针的函数。呼叫者必须按住mm->mmap_sem

Function that locks a user-space buffer into memory and returns the corresponding struct page pointers. The caller must hold mm->mmap_sem.

SetPageDirty(struct page *page);
SetPageDirty(struct page *page);

宏观那个 将给定页面标记为“脏”(已修改),并且需要在释放它之前写入其后备存储。

Macro that marks the given page as "dirty" (modified) and in need of writing to its backing store before it can be freed.

void page_cache_release(struct page *page);
void page_cache_release(struct page *page);

从页面缓存中释放给定页面。

Frees the given page from the page cache.

int is_sync_kiocb(struct kiocb *iocb);
int is_sync_kiocb(struct kiocb *iocb);

如果给定的 IOCB 需要同步执行,则返回非零的宏。

Macro that returns nonzero if the given IOCB requires synchronous execution.

int aio_complete(struct kiocb *iocb, long res, long res2);
int aio_complete(struct kiocb *iocb, long res, long res2);

指示异步 I/O 操作完成的函数。

Function that indicates completion of an asynchronous I/O operation.

直接内存访问

Direct Memory Access

#include <asm/io.h>

unsigned long virt_to_bus(volatile void * address);

void * bus_to_virt(unsigned long address);
#include <asm/io.h>

unsigned long virt_to_bus(volatile void * address);

void * bus_to_virt(unsigned long address);

在内核、虚拟和总线地址之间进行转换的已过时和已弃用的函数。必须使用总线地址与外围设备通信。

Obsolete and deprecated functions that convert between kernel, virtual, and bus addresses. Bus addresses must be used to talk to peripheral devices.

#include <linux/dma-mapping.h>
#include <linux/dma-mapping.h>

定义通用 DMA 函数所需的头文件。

Header file required to define the generic DMA functions.

int dma_set_mask(struct device *dev, u64 mask);
int dma_set_mask(struct device *dev, u64 mask);

为了 对于无法寻址完整 32 位范围的外设,此函数会通知内核可寻址范围,并在 DMA 可行时返回非零值。

For peripherals that cannot address the full 32-bit range, this function informs the kernel of the addressable range and returns nonzero if DMA is possible.

void *dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t

*bus_addr, int flag)

void dma_free_coherent(struct device *dev, size_t size, void *cpuaddr,

dma_handle_t bus_addr);
void *dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t

*bus_addr, int flag)

void dma_free_coherent(struct device *dev, size_t size, void *cpuaddr,

dma_handle_t bus_addr);

为将持续驱动程序生命周期的缓冲区分配和释放相干 DMA 映射。

Allocate and free coherent DMA mappings for a buffer that will last the lifetime of the driver.

#include <linux/dmapool.h>

struct dma_pool *dma_pool_create(const char *name, struct device *dev,

size_t size, size_t align, size_t allocation);

void dma_pool_destroy(struct dma_pool *pool);

void *dma_pool_alloc(struct dma_pool *pool, int mem_flags, dma_addr_t

*handle);

void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t handle);
#include <linux/dmapool.h>

struct dma_pool *dma_pool_create(const char *name, struct device *dev,

size_t size, size_t align, size_t allocation);

void dma_pool_destroy(struct dma_pool *pool);

void *dma_pool_alloc(struct dma_pool *pool, int mem_flags, dma_addr_t

*handle);

void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t handle);

创建、销毁和使用 DMA 池来管理小型 DMA 区域的函数。

Functions that create, destroy, and use DMA pools to manage small DMA areas.

enum dma_data_direction;

DMA_TO_DEVICE

DMA_FROM_DEVICE

DMA_BIDIRECTIONAL

DMA_NONE
enum dma_data_direction;

DMA_TO_DEVICE

DMA_FROM_DEVICE

DMA_BIDIRECTIONAL

DMA_NONE

用于告诉流映射函数数据移入或移出缓冲区的方向的符号。

Symbols used to tell the streaming mapping functions the direction in which data is moving to or from the buffer.

dma_addr_t dma_map_single(struct device *dev, void *buffer, size_t size, enum

dma_data_direction direction);

void dma_unmap_single(struct device *dev, dma_addr_t bus_addr, size_t size,

enum dma_data_direction direction);
dma_addr_t dma_map_single(struct device *dev, void *buffer, size_t size, enum

dma_data_direction direction);

void dma_unmap_single(struct device *dev, dma_addr_t bus_addr, size_t size,

enum dma_data_direction direction);

创建和销毁一次性流式 DMA 映射。

Create and destroy a single-use, streaming DMA mapping.

void dma_sync_single_for_cpu(struct device *dev, dma_handle_t bus_addr, size_t

size, enum dma_data_direction direction);

void dma_sync_single_for_device(struct device *dev, dma_handle_t bus_addr,

size_t size, enum dma_data_direction direction);
void dma_sync_single_for_cpu(struct device *dev, dma_handle_t bus_addr, size_t

size, enum dma_data_direction direction);

void dma_sync_single_for_device(struct device *dev, dma_handle_t bus_addr,

size_t size, enum dma_data_direction direction);

同步具有流映射的缓冲区。如果处理器必须在流映射到位时(即,当设备拥有缓冲区时)访问缓冲区,则必须使用这些函数。

Synchronizes a buffer that has a streaming mapping. These functions must be used if the processor must access a buffer while the streaming mapping is in place (i.e., while the device owns the buffer).

#include <asm/scatterlist.h>

struct scatterlist { /* ... */ };

dma_addr_t sg_dma_address(struct scatterlist *sg);

unsigned int sg_dma_len(struct scatterlist *sg);
#include <asm/scatterlist.h>

struct scatterlist { /* ... */ };

dma_addr_t sg_dma_address(struct scatterlist *sg);

unsigned int sg_dma_len(struct scatterlist *sg);

scatterlist结构描述了涉及多个缓冲区的 I/O 操作。宏 sg_dma_addresssg_dma_len可用于在实现分散/聚集操作时提取总线地址和缓冲区长度以传递给设备。

The scatterlist structure describes an I/O operation that involves more than one buffer. The macros sg_dma_address and sg_dma_len may be used to extract bus addresses and buffer lengths to pass to the device when implementing scatter/gather operations.

dma_map_sg(struct device *dev, struct scatterlist *list, int nents,

enum dma_data_direction direction);

dma_unmap_sg(struct device *dev, struct scatterlist *list, int nents, enum

dma_data_direction direction);

void dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg, int

nents, enum dma_data_direction direction);

void dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg, int

nents, enum dma_data_direction direction);
dma_map_sg(struct device *dev, struct scatterlist *list, int nents,

enum dma_data_direction direction);

dma_unmap_sg(struct device *dev, struct scatterlist *list, int nents, enum

dma_data_direction direction);

void dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg, int

nents, enum dma_data_direction direction);

void dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg, int

nents, enum dma_data_direction direction);

dma_map_sg映射分散/聚集操作, dma_unmap_sg撤消该映射。如果在映射处于活动状态时必须访问缓冲区,则可以使用dma_sync_sg_*来同步事物。

dma_map_sg maps a scatter/gather operation, and dma_unmap_sg undoes that mapping. If the buffers must be accessed while the mapping is active, dma_sync_sg_* may be used to synchronize things.

/proc/dma
/proc/dma

包含 DMA 控制器中已分配通道的文本快照的文件。基于 PCI 的 DMA 未显示,因为每个板独立工作,无需在 DMA 控制器中分配通道。

File that contains a textual snapshot of the allocated channels in the DMA controllers. PCI-based DMA is not shown because each board works independently, without the need to allocate a channel in the DMA controller.

#include <asm/dma.h>
#include <asm/dma.h>

定义或原型化与 DMA 相关的所有函数和宏的标头。必须包含它才能使用以下任何符号。

Header that defines or prototypes all the functions and macros related to DMA. It must be included to use any of the following symbols.

int request_dma(unsigned int channel, const char *name);

void free_dma(unsigned int channel);
int request_dma(unsigned int channel, const char *name);

void free_dma(unsigned int channel);

访问 DMA 注册表。使用 ISA DMA 通道之前必须执行注册。

Access the DMA registry. Registration must be performed before using ISA DMA channels.

unsigned long claim_dma_lock( );

void release_dma_lock(unsigned long flags);
unsigned long claim_dma_lock( );

void release_dma_lock(unsigned long flags);

获取并释放 DMA 自旋锁,必须在调用本列表后面介绍的其他 ISA DMA 函数之前保持该自旋锁。它们还禁用和重新启用本地处理器上的中断。

Acquire and release the DMA spinlock, which must be held prior to calling the other ISA DMA functions described later in this list. They also disable and reenable interrupts on the local processor.

void set_dma_mode(unsigned int channel, char mode);

void set_dma_addr(unsigned int channel, unsigned int addr);

void set_dma_count(unsigned int channel, unsigned int count);
void set_dma_mode(unsigned int channel, char mode);

void set_dma_addr(unsigned int channel, unsigned int addr);

void set_dma_count(unsigned int channel, unsigned int count);

将 DMA 信息编程到 DMA 控制器中。addr是总线地址。

Program DMA information in the DMA controller. addr is a bus address.

void disable_dma(unsigned int channel);

void enable_dma(unsigned int channel);
void disable_dma(unsigned int channel);

void enable_dma(unsigned int channel);

配置期间必须禁用 DMA 通道。这些函数改变 DMA 通道的状态。

A DMA channel must be disabled during configuration. These functions change the status of the DMA channel.

int get_dma_residue(unsigned int channel);
int get_dma_residue(unsigned int channel);

如果驱动程序需要知道 DMA 传输是如何进行的,则可以调用此函数,该函数返回尚未完成的数据传输数。DMA成功完成后,函数返回0;数据传输时该值是不可预测的。

If the driver needs to know how a DMA transfer is proceeding, it can call this function, which returns the number of data transfers that are yet to be completed. After successful completion of DMA, the function returns 0; the value is unpredictable while data is being transferred.

void clear_dma_ff(unsigned int channel)
void clear_dma_ff(unsigned int channel)

控制器使用 DMA 触发器通过两个 8 位操作来传输 16 位值。在向控制器发送任何数据之前必须将其清除。

The DMA flip-flop is used by the controller to transfer 16-bit values by means of two 8-bit operations. It must be cleared before sending any data to the controller.




[ 1 ]许多非 x86 架构无需此处描述的内核/用户空间分割即可高效运行,因此它们可以在 32 位系统上使用高达 4 GB 的内核地址空间。但是,当安装的内存超过 4 GB 时,本节中描述的限制仍然适用于此类系统。

[1] Many non-x86 architectures are able to efficiently do without the kernel/user-space split described here, so they can work with up to a 4-GB kernel address space on 32-bit systems. The constraints described in this section still apply to such systems when more than 4 GB of memory are installed, however.

[ 2 ] 2.6 内核(带有附加补丁)可以在 x86 硬件上支持“4G/4G”模式,从而以较低的性能成本实现更大的内核和用户虚拟地址空间。

[2] The 2.6 kernel (with an added patch) can support a "4G/4G" mode on x86 hardware, which enables larger kernel and user virtual address spaces at a mild performance cost.

[ 3 ] BSS这个名字是一个古老的装配操作员的历史遗迹,意思是“由符号开始的块”。可执行文件的BSS段不存储在磁盘上,内核将零页映射到BSS地址范围。

[3] The name BSS is a historical relic from an old assembly operator meaning "block started by symbol." The BSS segment of executable files isn't stored on disk, and the kernel maps the zero page to the BSS address range.

[ 4 ]当然,凡事都有例外;有关如何使用轮询最好地实现高性能网络驱动程序的演示,请参见第 15.2.6 节。

[4] There are, of course, exceptions to everything; see Section 15.2.6 for a demonstration of how high-performance network drivers are best implemented using polling.

[ 5 ]碎片一词通常应用于磁盘,以表达文件在磁介质上不是连续存储的想法。同样的概念也适用于内存,其中每个虚拟地址空间分散在整个物理 RAM 中,并且当请求 DMA 缓冲区时很难检索连续的空闲页面。

[5] The word fragmentation is usually applied to disks to express the idea that files are not stored consecutively on the magnetic medium. The same concept applies to memory, where each virtual address space gets scattered throughout physical RAM, and it becomes difficult to retrieve consecutive free pages when a DMA buffer is requested.

[ 6 ]这些电路现在是主板芯片组的一部分,但几年前它们是两个独立的 8237 芯片。

[6] These circuits are now part of the motherboard's chipset, but a few years ago they were two separate 8237 chips.

[ 7 ]最初的PC只有一个控制器;第二个是在基于 286 的平台中添加的。然而,第二个控制器作为主控制器连接,因为它处理 16 位传输;第一个一次仅传输八位,并且是为了向后兼容。

[7] The original PCs had only one controller; the second was added in 286-based platforms. However, the second controller is connected as the master because it handles 16-bit transfers; the first transfers only eight bits at a time and is there for backward compatibility.

第 16 章. 块驱动程序

Chapter 16. Block Drivers

到目前为止,我们的讨论仅限于字符驱动程序。然而,Linux 系统中还有其他类型的驱动程序,现在是我们稍微扩大我们的关注范围的时候了。因此,本章讨论块驱动程序。

So far, our discussion has been limited to char drivers. There are other types of drivers in Linux systems, however, and the time has come for us to widen our focus somewhat. Accordingly, this chapter discusses block drivers.

块驱动程序提供对以固定大小的块(主要是磁盘驱动器)传输可随机访问数据的设备的访问。Linux 内核认为块设备与字符设备有根本的不同。因此,块驱动程序具有独特的接口和其自身的特殊挑战。

A block driver provides access to devices that transfer randomly accessible data in fixed-size blocks—disk drives, primarily. The Linux kernel sees block devices as being fundamentally different from char devices; as a result, block drivers have a distinct interface and their own particular challenges.

高效的块驱动程序对于性能至关重要,而不仅仅是对于用户应用程序中的显式读写。具有虚拟内存的现代系统通过将(希望)不需要的数据转移到辅助存储(通常是磁盘驱动器)来工作。块驱动程序是核心内存和辅助存储之间的管道;因此,它们可以被视为虚拟内存子系统的一部分。虽然可以在不了解struct page其他重要内存概念的情况下编写块驱动程序,但任何需要编写高性能驱动程序的人都必须利用第15 章中介绍的材料。

Efficient block drivers are critical for performance—and not just for explicit reads and writes in user applications. Modern systems with virtual memory work by shifting (hopefully) unneeded data to secondary storage, which is usually a disk drive. Block drivers are the conduit between core memory and secondary storage; therefore, they can be seen as making up part of the virtual memory subsystem. While it is possible to write a block driver without knowing about struct page and other important memory concepts, anybody needing to write a high-performance driver has to draw upon the material covered in Chapter 15.

块层的大部分设计都以性能为中心。许多字符设备可以低于其最大速度运行,并且整个系统的性能不受影响。然而,如果其块 I/O 子系统没有很好地调整,系统就无法正常运行。Linux 块驱动程序接口允许您充分利用块设备,但必然会带来一定程度的复杂性,您必须处理这一点。令人高兴的是,2.6 块接口比旧内核中的接口有了很大改进。

Much of the design of the block layer is centered on performance. Many char devices can run below their maximum speed, and the performance of the system as a whole is not affected. The system cannot run well, however, if its block I/O subsystem is not well-tuned. The Linux block driver interface allows you to get the most out of a block device but imposes, necessarily, a degree of complexity that you must deal with. Happily, the 2.6 block interface is much improved over what was found in older kernels.

正如人们所期望的那样,本章的讨论集中于实现面向块、基于内存的设备的示例驱动程序。它本质上是一个 ramdisk。内核已经包含了一个更加优越的 ramdisk 实现,但是我们的驱动程序(称为 sbull)让我们可以演示块驱动程序的创建,同时最大限度地减少不相关的复杂性。

The discussion in this chapter is, as one would expect, centered on an example driver that implements a block-oriented, memory-based device. It is, essentially, a ramdisk. The kernel already contains a far superior ramdisk implementation, but our driver (called sbull) lets us demonstrate the creation of a block driver while minimizing unrelated complexity.

在讨论细节之前,让我们精确定义几个术语。块 固定大小的数据块,大小由内核确定。块通常为 4096 字节,但该值可能会根据体系结构和所使用的确切文件系统而变化。一个部门相反, 是一个小块,其大小通常由底层硬件决定。内核期望处理实现 512 字节扇区的设备。如果您的设备使用不同的大小,内核会进行调整并避免生成硬件无法处理的 I/O 请求。然而,值得记住的是,每当内核向您提供扇区号时,它都在 512 字节扇区的世界中工作。如果您使用不同的硬件扇区大小,则必须相应地缩放内核的扇区数。我们看看sbull驱动程序是如何完成的 。

Before getting into the details, let's define a couple of terms precisely. A block is a fixed-size chunk of data, the size being determined by the kernel. Blocks are often 4096 bytes, but that value can vary depending on the architecture and the exact filesystem being used. A sector, in contrast, is a small block whose size is usually determined by the underlying hardware. The kernel expects to be dealing with devices that implement 512-byte sectors. If your device uses a different size, the kernel adapts and avoids generating I/O requests that the hardware cannot handle. It is worth keeping in mind, however, that any time the kernel presents you with a sector number, it is working in a world of 512-byte sectors. If you are using a different hardware sector size, you have to scale the kernel's sector numbers accordingly. We see how that is done in the sbull driver.

登记

Registration

块驱动程序,例如 char 驱动程序必须使用一组注册接口来使其设备可供内核使用。概念相似,但块设备注册的细节有所不同。您需要学习一套全新的数据结构和设备操作。

Block drivers, like char drivers, must use a set of registration interfaces to make their devices available to the kernel. The concepts are similar, but the details of block device registration are all different. You have a whole new set of data structures and device operations to learn.

阻止驱动程序注册

Block Driver Registration

大多数块驱动程序采取的第一步是向内核注册自己。该任务的函数是 register_blkdev (在<linux/fs.h>中声明):

The first step taken by most block drivers is to register themselves with the kernel. The function for this task is register_blkdev (which is declared in <linux/fs.h>):

int register_blkdev(unsigned int Major, const char *name);
int register_blkdev(unsigned int major, const char *name);

参数是您的设备将使用的主设备号和关联的名称(内核将在/proc/devices中显示该名称)。如果major作为 传递 0,内核分配一个新的主设备号并将其返回给调用者。与往常一样, register_blkdev的负返回值 表示发生了错误。

The arguments are the major number that your device will be using and the associated name (which the kernel will display in /proc/devices). If major is passed as 0, the kernel allocates a new major number and returns it to the caller. As always, a negative return value from register_blkdev indicates that an error has occurred.

取消块驱动注册对应的函数是:

The corresponding function for canceling a block driver registration is:

int unregister_blkdev(unsigned int Major, const char *name);
int unregister_blkdev(unsigned int major, const char *name);

在这里,参数必须与传递给register_blkdev 的参数匹配,否则函数返回-EINVAL并且不会取消注册任何内容。

Here, the arguments must match those passed to register_blkdev, or the function returns -EINVAL and not unregister anything.

在 2.6 内核中,对register_blkdev 的调用完全是可选的。register_blkdev执行的功能随着时间的推移而逐渐减少;此时此调用执行的唯一任务是(1)根据请求分配动态主设备号,以及(2)在/proc/devices中创建一个条目。在未来的内核中, register_blkdev可能会被完全删除。但与此同时,大多数司机仍然这样称呼它:这是传统的。

In the 2.6 kernel, the call to register_blkdev is entirely optional. The functions performed by register_blkdev have been decreasing over time; the only tasks performed by this call at this point are (1) allocating a dynamic major number if requested, and (2) creating an entry in /proc/devices. In future kernels, register_blkdev may be removed altogether. Meanwhile, however, most drivers still call it; it's traditional.

磁盘注册

Disk Registration

虽然register_blkdev可以是 用于获取主编号,它不会使任何磁盘驱动器可供系统使用。您必须使用一个单独的注册界面来管理各个驱动器。使用这个接口需要熟悉一对新结构,所以这就是我们开始的地方。

While register_blkdev can be used to obtain a major number, it does not make any disk drives available to the system. There is a separate registration interface that you must use to manage individual drives. Using this interface requires familiarity with a pair of new structures, so that is where we start.

块设备操作

Block device operations

字符设备 通过结构使它们的操作可供系统使用file_operations。块设备也使用类似的结构;它是在<linux/fs.h>struct block_device_operations中声明的 。以下是此结构中字段的简要概述;当我们了解sbull驱动程序的详细信息时,我们会更详细地重新审视它们:

Char devices make their operations available to the system by way of the file_operations structure. A similar structure is used with block devices; it is struct block_device_operations, which is declared in <linux/fs.h>. The following is a brief overview of the fields found in this structure; we revisit them in more detail when we get into the details of the sbull driver:

int (*open)(struct inode *inode, struct file *filp);

int (*release)(struct inode *inode, struct file *filp);
int (*open)(struct inode *inode, struct file *filp);

int (*release)(struct inode *inode, struct file *filp);

与 char 驱动程序等效的函数一样工作;每当设备打开和关闭时都会调用它们。块驱动程序可能会通过旋转设备、锁定门(对于可移动媒体)等来响应开放调用。如果您将媒体锁定到设备中,那么您当然应该在释放方法中将其 解锁

Functions that work just like their char driver equivalents; they are called whenever the device is opened and closed. A block driver might respond to an open call by spinning up the device, locking the door (for removable media), etc. If you lock media into the device, you should certainly unlock it in the release method.

int (*ioctl)(struct inode *inode, struct file *filp, unsigned int cmd,

unsigned long arg);
int (*ioctl)(struct inode *inode, struct file *filp, unsigned int cmd,

unsigned long arg);

实现ioctl系统调用的方法。然而,块层首先拦截大量标准请求;所以大多数块驱动程序ioctl方法都相当短。

Method that implements the ioctl system call. The block layer first intercepts a large number of standard requests, however; so most block driver ioctl methods are fairly short.

int (*media_changed) (struct gendisk *gd);
int (*media_changed) (struct gendisk *gd);

内核调用的方法,用于检查用户是否更改了驱动器中的介质,如果更改则返回非零值。显然,此方法仅适用于支持可移动介质的驱动器(并且足够智能,可以为驱动程序提供“介质已更改”标志);其他情况下可以省略。

Method called by the kernel to check whether the user has changed the media in the drive, returning a nonzero value if so. Obviously, this method is only applicable to drives that support removable media (and that are smart enough to make a "media changed" flag available to the driver); it can be omitted in other cases.

争论struct gendisk的焦点是内核如何表示单个磁盘;我们将在下一节中讨论该结构。

The struct gendisk argument is how the kernel represents a single disk; we will be looking at that structure in the next section.

int (*revalidate_disk) (struct gendisk *gd);
int (*revalidate_disk) (struct gendisk *gd);

调用revalidate_disk方法来响应介质更改它使驾驶员有机会执行使新媒体可供使用所需的任何工作。该函数返回一个int值,但该值被内核忽略。

The revalidate_disk method is called in response to a media change; it gives the driver a chance to perform whatever work is required to make the new media ready for use. The function returns an int value, but that value is ignored by the kernel.

struct module *owner;
struct module *owner;

指向拥有该结构的模块的指针;通常应将其初始化为THIS_MODULE.

A pointer to the module that owns this structure; it should usually be initialized to THIS_MODULE.

细心的读者可能已经注意到这个列表中一个有趣的遗漏:没有实际读取或写入数据的函数。在块 I/O 子系统中,这些操作由请求函数处理,该函数值得用很大一部分来讨论,并将在本章后面讨论。在我们讨论服务请求之前,我们必须完成对磁盘注册的讨论。

Attentive readers may have noticed an interesting omission from this list: there are no functions that actually read or write data. In the block I/O subsystem, these operations are handled by the request function, which deserves a large section of its own and is discussed later in the chapter. Before we can talk about servicing requests, we must complete our discussion of disk registration.

gendisk结构

The gendisk structure

struct gendisk (在<linux/genhd.h>中声明)是内核对单个磁盘设备的表示。事实上,内核也使用gendisk结构体来表示分区,但驱动程序作者不需要意识到这一点。其中有几个字段struct gendisk必须由块驱动程序初始化:

struct gendisk (declared in <linux/genhd.h>) is the kernel's representation of an individual disk device. In fact, the kernel also uses gendisk structures to represent partitions, but driver authors need not be aware of that. There are several fields in struct gendisk that must be initialized by a block driver:

int major;

int first_minor;

int minors;
int major;

int first_minor;

int minors;

描述磁盘使用的设备号的字段。驱动器必须至少使用一个次要编号。但是,如果您的驱动器是可分区的(大多数情况下应该是),您还需要为每个可能的分区分配一个次要编号。的常见值为minors16,它允许“全磁盘”设备和 15 个分区。某些磁盘驱动程序为每个设备使用 64 个次设备号。

Fields that describe the device number(s) used by the disk. At a minimum, a drive must use at least one minor number. If your drive is to be partitionable, however (and most should be), you want to allocate one minor number for each possible partition as well. A common value for minors is 16, which allows for the "full disk" device and 15 partitions. Some disk drivers use 64 minor numbers for each device.

char disk_name[32];
char disk_name[32];

应设置为磁盘设备名称的字段。它显示在 /proc/partitions和 sysfs 中。

Field that should be set to the name of the disk device. It shows up in /proc/partitions and sysfs.

struct block_device_operations *fops;
struct block_device_operations *fops;

上一节中的一组设备操作。

Set of device operations from the previous section.

struct request_queue *queue;
struct request_queue *queue;

内核用来管理该设备的 I/O 请求的结构;我们将在第 16.3 节中对其进行研究。

Structure used by the kernel to manage I/O requests for this device; we examine it in Section 16.3.

int flags;
int flags;

一组(很少使用的)描述驱动器状态的标志。如果您的设备有可移动媒体,您应该设置GENHD_FL_REMOVABLE。CD-ROM 驱动器可以设置GENHD_FL_CD。如果由于某种原因,您不希望分区信息显示在/proc/partitions中,请设置GENHD_FL_SUPPRESS_PARTITION_INFO

A (little-used) set of flags describing the state of the drive. If your device has removable media, you should set GENHD_FL_REMOVABLE. CD-ROM drives can set GENHD_FL_CD. If, for some reason, you do not want partition information to show up in /proc/partitions, set GENHD_FL_SUPPRESS_PARTITION_INFO.

sector_t capacity;
sector_t capacity;

该驱动器的容量(以 512 字节扇区为单位)。该sector_t类型可以是 64 位宽。驱动程序不应直接设置该字段;相反,将扇区数传递给 set_capacity

The capacity of this drive, in 512-byte sectors. The sector_t type can be 64 bits wide. Drivers should not set this field directly; instead, pass the number of sectors to set_capacity.

void *private_data;
void *private_data;

块驱动程序可以使用该字段作为指向其自己的内部数据的指针。

Block drivers may use this field for a pointer to their own internal data.

内核提供了一小组用于处理gendisk结构的函数。我们在这里介绍它们,然后看看sbull如何 使用它们来使其磁盘设备可供系统使用。

The kernel provides a small set of functions for working with gendisk structures. We introduce them here, then see how sbull uses them to make its disk devices available to the system.

struct gendisk是一个动态分配的结构,需要特殊的内核操作已初始化;驱动程序无法自行分配结构。相反,您必须致电:

struct gendisk is a dynamically allocated structure that requires special kernel manipulation to be initialized; drivers cannot allocate the structure on their own. Instead, you must call:

struct gendisk *alloc_disk(int未成年人);
struct gendisk *alloc_disk(int minors);

参数minors应该是该磁盘使用的次要编号的数量;请注意,您无法minors稍后更改该字段并期望一切正常。

The minors argument should be the number of minor numbers this disk uses; note that you cannot change the minors field later and expect things to work properly.

当不再需要磁盘时,应将其 释放:

When a disk is no longer needed, it should be freed with:

void del_gendisk(struct gendisk *gd);
void del_gendisk(struct gendisk *gd);

Agendisk是一个引用计数结构(它包含一个 kobject)。有get_diskput_disk函数可用于操作引用计数,但驱动程序永远不需要这样做。通常,对del_gendisk 的调用 会删除对 gendisk 的最终引用,但不能保证这一点。因此,在调用 del_gendisk后,该结构可能会继续存在(并且可以调用您的方法) 。如果在没有用户的情况下删除该结构(即在最终版本之后或在模块 清理中)函数),但是,您可以确定您不会再收到它的消息。

A gendisk is a reference-counted structure (it contains a kobject). There are get_disk and put_disk functions available to manipulate the reference count, but drivers should never need to do that. Normally, the call to del_gendisk removes the final reference to a gendisk, but there are no guarantees of that. Thus, it is possible that the structure could continue to exist (and your methods could be called) after a call to del_gendisk. If you delete the structure when there are no users (that is, after the final release or in your module cleanup function), however, you can be sure that you will not hear from it again.

分配一个gendisk 结构不会使磁盘可供系统使用。为此,您必须初始化该结构并调用 add_disk

Allocating a gendisk structure does not make the disk available to the system. To do that, you must initialize the structure and call add_disk:

void add_disk(struct gendisk *gd);
void add_disk(struct gendisk *gd);

这里请记住一件重要的事情:一旦您调用 add_disk,磁盘就处于“活动”状态,并且可以随时调用其方法。事实上,第一次此类调用可能会在 add_disk返回之前发生;内核将读取前几个块以尝试找到分区表。 因此,在驱动程序完全初始化并准备好响应该磁盘上的请求之前,不应调用add_disk 。

Keep one important thing in mind here: as soon as you call add_disk, the disk is "live" and its methods can be called at any time. In fact, the first such calls will probably happen even before add_disk returns; the kernel will read the first few blocks in an attempt to find a partition table. So you should not call add_disk until your driver is completely initialized and ready to respond to requests on that disk.

sbull 中的初始化

Initialization in sbull

是时候静下心来了 一些例子。sbull驱动程序(可从 O'Reilly 的 FTP 站点以及示例源的其余部分获得)实现一组内存中虚拟磁盘驱动器。对于每个驱动器,sbull分配(为简单起见,使用vmalloc)一个内存数组;然后它通过块操作使该数组可用。可以通过对虚拟设备进行分区、在其上构建文件系统并将其安装到系统层次结构中来测试sbull驱动程序。

It is time to get down to some examples. The sbull driver (available from O'Reilly's FTP site with the rest of the example source) implements a set of in-memory virtual disk drives. For each drive, sbull allocates (with vmalloc, for simplicity) an array of memory; it then makes that array available via block operations. The sbull driver can be tested by partitioning the virtual device, building filesystems on it, and mounting it in the system hierarchy.

与我们的其他示例驱动程序一样,sbull允许在编译或模块加载时指定主编号。如果未指定数量,则动态分配一个。由于动态分配需要调用register_blkdev ,因此sbull会这样做:

Like our other example drivers, sbull allows a major number to be specified at compile or module load time. If no number is specified, one is allocated dynamically. Since a call to register_blkdev is required for dynamic allocation, sbull does so:

sbull_major = register_blkdev(sbull_major, "sbull");
如果(sbull_major <= 0){
    printk(KERN_WARNING "sbull: 无法获取主设备号\n");
    返回-EBUSY;
 }
sbull_major = register_blkdev(sbull_major, "sbull");
if (sbull_major <= 0) {
    printk(KERN_WARNING "sbull: unable to get major number\n");
    return -EBUSY;
 }

此外,与我们在本书中介绍的其他虚拟设备一样, sbull设备由内部结构描述:

Also, like the other virtual devices we have presented in this book, the sbull device is described by an internal structure:

结构 sbull_dev {
        整数大小;/* 以扇区为单位的设备大小 */
        u8 *数据;/* 数据数组 */
        短用户;/* 有多少用户 */
        简短的 media_change;/* 标记媒体更改?*/
        spinlock_t 锁;/* 用于互斥 */
        结构 request_queue *队列; /* 设备请求队列 */
        结构 gendisk *gd; /* gendisk 结构 */
        结构体timer_list定时器;/* 用于模拟媒体变化 */
};
struct sbull_dev {
        int size;                       /* Device size in sectors */
        u8 *data;                       /* The data array */
        short users;                    /* How many users */
        short media_change;             /* Flag a media change? */
        spinlock_t lock;                /* For mutual exclusion */
        struct request_queue *queue;    /* The device request queue */
        struct gendisk *gd;             /* The gendisk structure */
        struct timer_list timer;        /* For simulated media changes */
};

需要几个步骤来初始化该结构并使相关设备可供系统使用。我们从底层内存的基本初始化和分配开始:

Several steps are required to initialize this structure and make the associated device available to the system. We start with basic initialization and allocation of the underlying memory:

memset(dev, 0, sizeof(struct sbull_dev));
dev->size = nsectors*hardsect_size;
dev->data = vmalloc(dev->size);
if (dev->data == NULL) {
    printk (KERN_NOTICE "vmalloc 失败。\n");
    返回;
}
spin_lock_init(&dev->lock);
memset (dev, 0, sizeof (struct sbull_dev));
dev->size = nsectors*hardsect_size;
dev->data = vmalloc(dev->size);
if (dev->data =  = NULL) {
    printk (KERN_NOTICE "vmalloc failure.\n");
    return;
}
spin_lock_init(&dev->lock);

在下一步(即请求队列的分配)之前分配并初始化自旋锁非常重要。当我们开始处理请求时,我们会更详细地了解这个过程;现在,只需说必要的调用是:

It's important to allocate and initialize a spinlock before the next step, which is the allocation of the request queue. We look at this process in more detail when we get to request processing; for now, suffice it to say that the necessary call is:

dev->queue = blk_init_queue(sbull_request, &dev->lock);
dev->queue = blk_init_queue(sbull_request, &dev->lock);

在这里,sbull_request 是我们的 请求函数——实际执行块读写请求的函数。当我们分配请求队列时,我们必须提供一个自旋锁来控制对该队列的访问。锁是由驱动程序而不是内核的一般部分提供的,因为请求队列和其他驱动程序数据结构通常落入同一临界区;它们往往被一起访问。与任何分配内存的函数一样,blk_init_queue可能会失败,因此您必须在继续之前检查返回值。

Here, sbull_request is our request function—the function that actually performs block read and write requests. When we allocate a request queue, we must provide a spinlock that controls access to that queue. The lock is provided by the driver rather than the general parts of the kernel because, often, the request queue and other driver data structures fall within the same critical section; they tend to be accessed together. As with any function that allocates memory, blk_init_queue can fail, so you must check the return value before continuing.

一旦我们有了设备内存和请求队列,我们​​就可以分配、初始化和安装相应的gendisk 结构。完成这项工作的代码是:

Once we have our device memory and request queue in place, we can allocate, initialize, and install the corresponding gendisk structure. The code that does this work is:

dev->gd = alloc_disk(SBULL_MINORS);
if (!dev->gd) {
    printk (KERN_NOTICE "分配磁盘失败\n");
    转到out_vfree;
}
dev->gd->major = sbull_major;
dev->gd->first_minor = which*SBULL_MINORS;
dev->gd->fops = &sbull_ops;
dev->gd->队列 = dev->队列;
dev->gd->private_data = dev;
snprintf (dev->gd->disk_name, 32, "sbull%c", which + 'a');
set_capacity(dev->gd, nsectors*(hardsect_size/KERNEL_SECTOR_SIZE));
add_disk(dev->gd);
dev->gd = alloc_disk(SBULL_MINORS);
if (! dev->gd) {
    printk (KERN_NOTICE "alloc_disk failure\n");
    goto out_vfree;
}
dev->gd->major = sbull_major;
dev->gd->first_minor = which*SBULL_MINORS;
dev->gd->fops = &sbull_ops;
dev->gd->queue = dev->queue;
dev->gd->private_data = dev;
snprintf (dev->gd->disk_name, 32, "sbull%c", which + 'a');
set_capacity(dev->gd, nsectors*(hardsect_size/KERNEL_SECTOR_SIZE));
add_disk(dev->gd);

这里,是每个sbullSBULL_MINORS设备支持的次设备号的数量。当我们为每个设备设置第一个次要号码时,我们必须考虑先前设备占用的所有号码。磁盘的名称设置为第一个为sbulla,第二个为sbullb,依此类推。然后,用户空间可以添加分区号,以便第二个设备上的第三个分区可能是/dev/sbullb3

Here, SBULL_MINORS is the number of minor numbers each sbull device supports. When we set the first minor number for each device, we must take into account all of the numbers taken by prior devices. The name of the disk is set such that the first one is sbulla, the second sbullb, and so on. User space can then add partition numbers so that the third partition on the second device might be /dev/sbullb3.

一切设置完毕后,我们将调用add_disk来完成。当add_disk返回时,我们的几个方法很可能已经被调用 ,因此我们小心地将该调用作为设备初始化的最后一步。

Once everything is set up, we finish with a call to add_disk. Chances are that several of our methods will have been called for that disk by the time add_disk returns, so we take care to make that call the very last step in the initialization of our device.

关于扇区大小的注释

A Note on Sector Sizes

正如我们所提到的 之前,内核将每个磁盘视为 512 字节扇区的线性阵列。然而,并非所有硬件都使用该扇区大小。让具有不同扇区大小的设备工作并不是特别困难;只需要注意一些细节即可。sbull设备 导出一个hardsect_size参数,可以用来改变设备的“硬件”扇区大小;通过查看其实现,您可以了解如何将此类支持添加到您自己的驱动程序中。

As we have mentioned before, the kernel treats every disk as a linear array of 512-byte sectors. Not all hardware uses that sector size, however. Getting a device with a different sector size to work is not particularly hard; it is just a matter of taking care of a few details. The sbull device exports a hardsect_size parameter that can be used to change the "hardware" sector size of the device; by looking at its implementation, you can see how to add this sort of support to your own drivers.

第一个细节是通知内核您的设备支持的扇区大小。硬件扇区大小是请求队列中的参数,而不是 gendisk结构中的参数。该大小是通过调用 blk_queue_hardsect_size设置的 队列分配后立即:

The first of those details is to inform the kernel of the sector size your device supports. The hardware sector size is a parameter in the request queue, rather than in the gendisk structure. This size is set with a call to blk_queue_hardsect_size immediately after the queue is allocated:

blk_queue_hardsect_size(dev->队列,hardsect_size);
blk_queue_hardsect_size(dev->queue, hardsect_size);

完成后,内核将遵循设备的硬件扇区大小。所有 I/O 请求都在硬件扇区的开头正确对齐,每个请求的长度是扇区的整数倍。然而,您必须记住,内核总是以 512 字节扇区来表达自己;因此,有必要相应地翻译所有扇区号。因此,例如,当sbull在其结构中设置设备的容量时gendisk,调用如下所示:

Once that is done, the kernel adheres to your device's hardware sector size. All I/O requests are properly aligned at the beginning of a hardware sector, and the length of each request is an integral number of sectors. You must remember, however, that the kernel always expresses itself in 512-byte sectors; thus, it is necessary to translate all sector numbers accordingly. So, for example, when sbull sets the capacity of the device in its gendisk structure, the call looks like:

set_capacity(dev->gd, nsectors*(hardsect_size/KERNEL_SECTOR_SIZE));
set_capacity(dev->gd, nsectors*(hardsect_size/KERNEL_SECTOR_SIZE));

KERNEL_SECTOR_SIZE是一个本地定义的常量,我们用它来在内核的 512 字节扇区和我们被告知使用的任何大小之间进行缩放。我们经常会遇到这样的计算 看一下sbull请求处理逻辑。

KERNEL_SECTOR_SIZE is a locally-defined constant that we use to scale between the kernel's 512-byte sectors and whatever size we have been told to use. This sort of calculation pops up frequently as we look at the sbull request processing logic.

块设备操作

The Block Device Operations

我们有一个简单的介绍 block_device_operations上一节的结构。现在,在进入请求处理之前,我们花一些时间更详细地了解这些操作。为此,是时候提一下 sbull驱动程序的另一个功能了:它伪装成一个可移动设备。每当最后一个用户关闭设备时,就会设置一个 30 秒的计时器;如果在此期间未打开设备,则设备的内容将被清除,并且内核将被告知介质已更改。例如,30 秒的延迟为用户提供了在创建文件系统后挂载 sbull设备的时间。

We had a brief introduction to the block_device_operations structure in the previous section. Now we take some time to look at these operations in a bit more detail before getting into request processing. To that end, it is time to mention one other feature of the sbull driver: it pretends to be a removable device. Whenever the last user closes the device, a 30-second timer is set; if the device is not opened during that time, the contents of the device are cleared, and the kernel will be told that the media has been changed. The 30-second delay gives the user time to, for example, mount an sbull device after creating a filesystem on it.

open 和release 方法

The open and release Methods

为实施 模拟媒体删除,sbull必须知道最后一个用户何时关闭设备。用户计数由驱动程序维护。openclose方法的工作是 保持该计数最新。

To implement the simulated media removal, sbull must know when the last user has closed the device. A count of users is maintained by the driver. It is the job of the open and close methods to keep that count current.

open方法看起来与它的 char-driver 等效方法非常相似;它以相关指针inodefile结构指针作为参数。当索引节点引用块设备时,该字段i_bdev->bd_disk包含指向关联结构的指针gendisk;该指针可用于获取设备驱动程序的内部数据结构。事实上,这就是sbull open方法所做的第一件事:

The open method looks very similar to its char-driver equivalent; it takes the relevant inode and file structure pointers as arguments. When an inode refers to a block device, the field i_bdev->bd_disk contains a pointer to the associated gendisk structure; this pointer can be used to get to a driver's internal data structures for the device. That is, in fact, the first thing that the sbull open method does:

static int sbull_open(结构 inode *inode, 结构文件 *filp)
{
    struct sbull_dev *dev = inode->i_bdev->bd_disk->private_data;

    del_timer_sync(&dev->定时器);
    filp->private_data = dev;
    spin_lock(&dev->lock);
    if (!dev->用户)
        check_disk_change(inode->i_bdev);
    开发->用户++;
    spin_unlock(&dev->lock);
    返回0;
}
static int sbull_open(struct inode *inode, struct file *filp)
{
    struct sbull_dev *dev = inode->i_bdev->bd_disk->private_data;

    del_timer_sync(&dev->timer);
    filp->private_data = dev;
    spin_lock(&dev->lock);
    if (! dev->users) 
        check_disk_change(inode->i_bdev);
    dev->users++;
    spin_unlock(&dev->lock);
    return 0;
}

一旦sbull_open拥有其设备结构指针,它就会调用 del_timer_sync来删除“媒体删除”计时器(如果有的话)。请注意,在删除计时器之前,我们不会锁定设备自旋锁;如果计时器函数在我们删除之前运行,否则会导致死锁。设备锁定后,我们调用一个名为 check_disk_change的内核函数来检查是否发生了介质更改。有人可能会争辩说内核应该进行该调用,但标准模式是驱动程序在打开时处理它。

Once sbull_open has its device structure pointer, it calls del_timer_sync to remove the "media removal" timer, if any is active. Note that we do not lock the device spinlock until after the timer has been deleted; doing otherwise invites deadlock if the timer function runs before we can delete it. With the device locked, we call a kernel function called check_disk_change to check whether a media change has happened. One might argue that the kernel should make that call, but the standard pattern is for drivers to handle it at open time.

最后一步是增加用户计数并返回。

The last step is to increment the user count and return.

相反,release方法的任务是减少用户计数,并在有指示时启动媒体删除计时器:

The task of the release method is, in contrast, to decrement the user count and, if indicated, start the media removal timer:

static int sbull_release(结构 inode *inode, 结构文件 *filp)
{
    struct sbull_dev *dev = inode->i_bdev->bd_disk->private_data;

    spin_lock(&dev->lock);
    开发人员->用户--;

    如果(!dev->用户){
        dev->timer.expires = jiffies + INVALIDATE_DELAY;
        add_timer(&dev->定时器);
    }
    spin_unlock(&dev->lock);

    返回0;
}
static int sbull_release(struct inode *inode, struct file *filp)
{
    struct sbull_dev *dev = inode->i_bdev->bd_disk->private_data;

    spin_lock(&dev->lock);
    dev->users--;

    if (!dev->users) {
        dev->timer.expires = jiffies + INVALIDATE_DELAY;
        add_timer(&dev->timer);
    }
    spin_unlock(&dev->lock);

    return 0;
}

在处理真实硬件设备的驱动程序中,打开释放方法将相应地设置驱动程序和硬件的状态。这项工作可能涉及向上或向下旋转磁盘、锁定可移动设备的门、分配 DMA 缓冲区等。

In a driver that handles a real, hardware device, the open and release methods would set the state of the driver and hardware accordingly. This work could involve spinning the disk up or down, locking the door of a removable device, allocating DMA buffers, etc.

您可能想知道谁真正打开了块设备。有一些操作会导致块设备直接从用户空间打开;这些包括对磁盘进行分区、在分区上构建文件系统或运行文件系统检查器。当安装分区时,块驱动程序也会看到打开调用。在这种情况下,没有用户空间进程持有设备的打开文件描述符;相反,打开的文件由内核本身保存。块驱动程序无法区分挂载操作(从内核空间打开设备)和调用mkfs等实用程序(从用户空间打开设备)之间的区别。

You may be wondering who actually opens a block device. There are some operations that cause a block device to be opened directly from user space; these include partitioning a disk, building a filesystem on a partition, or running a filesystem checker. A block driver also sees an open call when a partition is mounted. In this case, there is no user-space process holding an open file descriptor for the device; the open file is, instead, held by the kernel itself. A block driver cannot tell the difference between a mount operation (which opens the device from kernel space) and the invocation of a utility such as mkfs (which opens it from user space).

支持可移动媒体

Supporting Removable Media

block_device_operations 结构包括两种支持可移动介质的方法。如果您正在为不可移动设备编写驱动程序,则可以安全地忽略这些方法。它们的实现相对简单。

The block_device_operations structure includes two methods for supporting removable media. If you are writing a driver for a nonremovable device, you can safely omit these methods. Their implementation is relatively straightforward.

媒体_已更改 调用方法(从check_disk_change)以查看介质是否已更改;如果发生这种情况,它应该返回一个非零值。sbull 实现很简单;如果媒体删除计时器已过期,它会查询已设置的标志:

The media_changed method is called (from check_disk_change) to see whether the media has been changed; it should return a nonzero value if this has happened. The sbull implementation is simple; it queries a flag that has been set if the media removal timer has expired:

int sbull_media_changed(struct gendisk *gd)
{
    struct sbull_dev *dev = gd->private_data;
    
    返回 dev->media_change;
}
int sbull_media_changed(struct gendisk *gd)
{
    struct sbull_dev *dev = gd->private_data;
    
    return dev->media_change;
}

重新验证 媒体更改后调用方法;它的工作是尽一切努力让驱动程序为新媒体(如果有)上的操作做好准备。调用 revalidate后,内核尝试重新读取分区表并重新启动设备。sbull实现只是重置media_change 标记并将设备内存清零以模拟空白磁盘的插入。

The revalidate method is called after a media change; its job is to do whatever is required to prepare the driver for operations on the new media, if any. After the call to revalidate, the kernel attempts to reread the partition table and start over with the device. The sbull implementation simply resets the media_change flag and zeroes out the device memory to simulate the insertion of a blank disk.

int sbull_revalidate(struct gendisk *gd)
{
    struct sbull_dev *dev = gd->private_data;
    
    if (dev->media_change) {
        开发->media_change = 0;
        memset(dev->数据, 0, dev->大小);
    }
    返回0;
}
int sbull_revalidate(struct gendisk *gd)
{
    struct sbull_dev *dev = gd->private_data;
    
    if (dev->media_change) {
        dev->media_change = 0;
        memset (dev->data, 0, dev->size);
    }
    return 0;
}

ioctl 方法

The ioctl Method

块设备可以提供ioctl 方法来执行设备控制功能。 然而,更高级别的块子系统代码会在您的驱动程序看到它们之前拦截许多ioctl命令(有关完整集,请参阅内核源代码中的drivers/block/ioctl.c )。 事实上,现代块驱动程序可能根本不需要实现很多ioctl命令。

Block devices can provide an ioctl method to perform device control functions. The higher-level block subsystem code intercepts a number of ioctl commands before your driver ever gets to see them, however (see drivers/block/ioctl.c in the kernel source for the full set). In fact, a modern block driver may not have to implement very many ioctl commands at all.

公牛_ ioctl方法仅处理一个命令——对设备几何形状的请求:

The sbull ioctl method handles only one command—a request for the device's geometry:

int sbull_ioctl (结构 inode *inode, 结构文件 *filp,
                 无符号整型 cmd,无符号长 arg)
{
    长尺寸;
    结构 hd_geometry 地理;
    struct sbull_dev *dev = filp->private_data;

    开关(cmd){
        案例HDIO_GETGEO:
        /*
         * 获取几何图形:由于我们是虚拟设备,所以我们必须制作
         * 提出一些看似合理的东西。所以我们声称有 16 个扇区,四个头,
         * 并计算出相应的气缸数。我们设置
         * 数据从第四扇区开始。
         */
        大小= dev->大小*(hardsect_size/KERNEL_SECTOR_SIZE);
        geo.圆柱体 = (大小 & ~0x3f) >> 6;
        地理头= 4;
        地理扇区 = 16;
        地理开始 = 4;
        if (copy_to_user((void _ _user *) arg, &geo, sizeof(geo)))
            返回-EFAULT;
        返回0;
    }

    返回-ENOTTY;/* 未知的命令 */
}
int sbull_ioctl (struct inode *inode, struct file *filp,
                 unsigned int cmd, unsigned long arg)
{
    long size;
    struct hd_geometry geo;
    struct sbull_dev *dev = filp->private_data;

    switch(cmd) {
        case HDIO_GETGEO:
        /*
         * Get geometry: since we are a virtual device, we have to make
         * up something plausible.  So we claim 16 sectors, four heads,
         * and calculate the corresponding number of cylinders.  We set the
         * start of data at sector four.
         */
        size = dev->size*(hardsect_size/KERNEL_SECTOR_SIZE);
        geo.cylinders = (size & ~0x3f) >> 6;
        geo.heads = 4;
        geo.sectors = 16;
        geo.start = 4;
        if (copy_to_user((void _ _user *) arg, &geo, sizeof(geo)))
            return -EFAULT;
        return 0;
    }

    return -ENOTTY; /* unknown command */
}

提供几何信息似乎是一项奇怪的任务,因为我们的设备是纯粹虚拟的,与轨道和圆柱体无关。多年来,即使是大多数实块硬件也配备了更加复杂的结构。内核不关心块设备的几何结构;它只是将其视为扇区的线性阵列。然而,某些用户空间实用程序仍然希望能够查询磁盘的几何结构。特别是,编辑分区表的fdisk工具依赖于柱面信息,如果该信息不可用,则该工具无法正常运行。

Providing geometry information may seem like a curious task, since our device is purely virtual and has nothing to do with tracks and cylinders. Even most real-block hardware has been furnished with much more complicated structures for many years. The kernel is not concerned with a block device's geometry; it sees it simply as a linear array of sectors. There are certain user-space utilities that still expect to be able to query a disk's geometry, however. In particular, the fdisk tool, which edits partition tables, depends on cylinder information and does not function properly if that information is not available.

我们希望sbull设备是可分区的,即使使用旧的、简单的工具也是如此。因此,我们提供了一种ioctl方法,该方法为可以匹配我们设备容量的几何形状提供了可信的虚构。大多数磁盘驱动程序都会执行类似的操作。请注意,像往常一样,如果需要的话,扇区计数会被转换,以匹配所使用的 512 字节约定。 核心。

We would like the sbull device to be partitionable, even with older, simple-minded tools. So, we have provided an ioctl method that comes up with a credible fiction for a geometry that could match the capacity of our device. Most disk drivers do something similar. Note that, as usual, the sector count is translated, if need be, to match the 512-byte convention used by the kernel.

请求处理

Request Processing

每个块驱动程序的核心 是它的请求函数。这个函数是真正工作完成的地方——或者至少是开始的地方;其余的都是开销。因此,我们花费了大量的时间来研究块驱动程序中的请求处理。

The core of every block driver is its request function. This function is where the real work gets done—or at least started; all the rest is overhead. Consequently, we spend a fair amount of time looking at request processing in block drivers.

磁盘驱动器的性能可能是整个系统性能的关键部分。因此,内核的块子系统在编写时非常考虑性能;它尽一切可能使您的驱动程序能够充分利用其控制的设备。这是一件好事,因为它可以实现极快的 I/O。另一方面,块子系统不必要地暴露了驱动程序 API 的大量复杂性。可以编写一个非常简单的请求函数(我们很快就会看到一个),但是如果您的驱动程序必须在复杂的硬件上以高级别执行,那么它就绝非简单。

A disk driver's performance can be a critical part of the performance of the system as a whole. Therefore, the kernel's block subsystem has been written with performance very much in mind; it does everything possible to enable your driver to get the most out of the devices it controls. This is a good thing, in that it enables blindingly fast I/O. On the other hand, the block subsystem unnecessarily exposes a great deal of complexity in the driver API. It is possible to write a very simple request function (we will see one shortly), but if your driver must perform at a high level on complex hardware, it will be anything but simple.

请求方法介绍

Introduction to the request Method

块驱动程序请求方法具有以下原型:

The block driver request method has the following prototype:

无效请求(request_queue_t *队列);
void request(request_queue_t *queue);

每当内核认为驱动程序需要处理设备上的某些读取、写入或其他操作时,就会调用此函数。请求 函数在返回之前不需要实际完成队列中的所有请求事实上,对于大多数真实设备来说,它可能无法完成其中任何一个。然而,它必须启动这些请求并确保它们最终都由驱动程序处理。

This function is called whenever the kernel believes it is time for your driver to process some reads, writes, or other operations on the device. The request function does not need to actually complete all of the requests on the queue before it returns; indeed, it probably does not complete any of them for most real devices. It must, however, make a start on those requests and ensure that they are all, eventually, processed by the driver.

每个设备都有一个请求队列。这是因为与磁盘之间的实际传输可能发生在距离内核请求它们的时间很远的地方,并且因为内核需要灵活地在最有利的时刻安排每次传输(例如,将影响扇区的请求分组在一起)在磁盘上紧密相连)。您可能还记得,请求 函数在创建请求队列时与该队列相关联。让我们回顾一下sbull是如何制作队列的:

Every device has a request queue. This is because actual transfers to and from a disk can take place far away from the time the kernel requests them, and because the kernel needs the flexibility to schedule each transfer at the most propitious moment (grouping together, for instance, requests that affect sectors close together on the disk). And the request function, you may remember, is associated with a request queue when that queue is created. Let us look back at how sbull makes its queue:

dev->queue = blk_init_queue(sbull_request, &dev->lock);
dev->queue = blk_init_queue(sbull_request, &dev->lock);

因此,当队列为创建后,请求函数与其关联。我们还提供了一个自旋锁作为队列创建过程的一部分。每当我们的请求函数被调用时,该锁就由内核持有。结果,请求函数在原子上下文中运行;它必须遵循第 5 章中讨论的原子代码的所有常用规则。

Thus, when the queue is created, the request function is associated with it. We also provided a spinlock as part of the queue creation process. Whenever our request function is called, that lock is held by the kernel. As a result, the request function is running in an atomic context; it must follow all of the usual rules for atomic code discussed in Chapter 5.

当您的请求函数持有锁时,队列锁还可以防止内核对您的设备的任何其他请求进行排队。在某些情况下,您可能需要考虑在 请求函数运行时删除该锁。但是,如果这样做,则必须确保在未持有锁时不要访问请求队列或受锁保护的任何其他数据结构。您还必须在请求函数返回之前重新获取锁 。

The queue lock also prevents the kernel from queuing any other requests for your device while your request function holds the lock. Under some conditions, you may want to consider dropping that lock while the request function runs. If you do so, however, you must be sure not to access the request queue, or any other data structure protected by the lock, while the lock is not held. You must also reacquire the lock before the request function returns.

最后,请求函数的调用(通常)与任何用户空间进程的操作完全异步。您不能假设内核正在启动当前请求的进程的上下文中运行。您不知道请求提供的 I/O 缓冲区是在内核空间还是用户空间。因此,任何类型的显式访问用户空间的操作都是错误的,并且肯定会导致麻烦。正如您将看到的,驱动程序需要了解的有关请求的所有信息都包含在通过请求队列传递给您的结构中。

Finally, the invocation of the request function is (usually) entirely asynchronous with respect to the actions of any user-space process. You cannot assume that the kernel is running in the context of the process that initiated the current request. You do not know if the I/O buffer provided by the request is in kernel or user space. So any sort of operation that explicitly accesses user space is in error and will certainly lead to trouble. As you will see, everything your driver needs to know about the request is contained within the structures passed to you via the request queue.

一个简单的请求方法

A Simple request Method

sbull示例驱动程序 提供了几种不同的请求处理方法。默认情况下,sbull使用名为sbull_request的方法 ,该方法是最简单的请求方法的示例。话不多说,这里是:

The sbull example driver provides a few different methods for request processing. By default, sbull uses a method called sbull_request, which is meant to be an example of the simplest possible request method. Without further ado, here it is:

静态无效 sbull_request(request_queue_t *q)
{
    结构请求 *req;

    while ((req = elv_next_request(q)) != NULL) {
        struct sbull_dev *dev = req->rq_disk->private_data;
        if (!blk_fs_request(req)) {
            printk (KERN_NOTICE "跳过非 fs 请求\n");
            结束请求(请求,0);
            继续;
        }
        sbull_transfer(dev, 请求->扇区, 请求->current_nr_sectors,
                请求->缓冲区,rq_data_dir(req));
        结束请求(请求,1);
    }
}
static void sbull_request(request_queue_t *q)
{
    struct request *req;

    while ((req = elv_next_request(q)) != NULL) {
        struct sbull_dev *dev = req->rq_disk->private_data;
        if (! blk_fs_request(req)) {
            printk (KERN_NOTICE "Skip non-fs request\n");
            end_request(req, 0);
            continue;
        }
        sbull_transfer(dev, req->sector, req->current_nr_sectors,
                req->buffer, rq_data_dir(req));
        end_request(req, 1);
    }
}

该函数引入了struct request结构。稍后我们将struct request详细研究;现在,只要说它代表了一个供我们执行的块 I/O 请求就足够了。

This function introduces the struct request structure. We will examine struct request in great detail later on; for now, suffice it to say that it represents a block I/O request for us to execute.

内核提供了函数elv_next_request 获取队列中第一个不完整的请求;NULL当没有要处理的请求时该函数返回。请注意, elv_next_request不会从队列中删除请求。如果您调用它两次而没有中间操作,则request两次都会返回相同的结构。在这种简单的操作模式中,仅当请求完成时才会将其从队列中取出。

The kernel provides the function elv_next_request to obtain the first incomplete request on the queue; that function returns NULL when there are no requests to be processed. Note that elv_next_request does not remove the request from the queue. If you call it twice with no intervening operations, it returns the same request structure both times. In this simple mode of operation, requests are taken off the queue only when they are complete.

块请求队列可以包含实际上并不将块移入或移出磁盘的请求。此类请求可以包括供应商特定的低级诊断操作或与专用设备模式相关的指令,例如可记录介质的数据包写入模式。大多数块驱动程序不知道如何处理此类请求,并且只会失败;sbull也以这种方式工作。对block_fs_request的调用 告诉我们是否正在查看文件系统请求——移动数据块的请求。如果请求不是文件系统请求,我们将其传递给end_request

A block request queue can contain requests that do not actually move blocks to and from a disk. Such requests can include vendor-specific, low-level diagnostics operations or instructions relating to specialized device modes, such as the packet writing mode for recordable media. Most block drivers do not know how to handle such requests and simply fail them; sbull works in this way as well. The call to block_fs_request tells us whether we are looking at a filesystem request—one that moves blocks of data. If a request is not a filesystem request, we pass it to end_request:

void end_request(struct request *req, int 成功);
void end_request(struct request *req, int succeeded);

当我们处理非文件系统请求时,我们传递succeededas0来表明我们没有成功完成请求。否则,我们调用sbull_transfer 来实际移动数据,使用request结构中提供的一组字段:

When we dispose of nonfilesystem requests, we pass succeeded as 0 to indicate that we did not successfully complete the request. Otherwise, we call sbull_transfer to actually move the data, using a set of fields provided in the request structure:

sector_t sector;
sector_t sector;

我们设备上起​​始扇区的索引。请记住,该扇区号与在内核和驱动程序之间传递的所有此类数字一样,以 512 字节扇区表示。如果您的硬件使用不同的扇区大小,则需要sector相应地进行扩展。例如,如果硬件使用 2048 字节扇区,则需要将起始扇区号除以 4,然后再将其放入对硬件的请求中。

The index of the beginning sector on our device. Remember that this sector number, like all such numbers passed between the kernel and the driver, is expressed in 512-byte sectors. If your hardware uses a different sector size, you need to scale sector accordingly. For example, if the hardware uses 2048-byte sectors, you need to divide the beginning sector number by four before putting it into a request for the hardware.

unsigned long nr_sectors;
unsigned long nr_sectors;

要传输的扇区数(512 字节)。

The number of (512-byte) sectors to be transferred.

char *buffer;
char *buffer;

指向应向其传输数据或从中传输数据的缓冲区的指针。该指针是内核虚拟地址,如果需要,驱动程序可以直接取消引用。

A pointer to the buffer to or from which the data should be transferred. This pointer is a kernel virtual address and can be dereferenced directly by the driver if need be.

rq_data_dir(struct request *req);
rq_data_dir(struct request *req);

该宏从请求中提取传输方向;零返回值表示从设备读取,非零返回值表示对设备写入。

This macro extracts the direction of the transfer from the request; a zero return value denotes a read from the device, and a nonzero return value denotes a write to the device.

有了这些信息,sbull驱动程序就可以通过简单的memcpy调用来实现实际的数据传输——毕竟我们的数据已经在内存中了。执行此复制操作的函数 ( sbull_transfer ) 还处理扇区大小的缩放,并确保我们不会尝试复制超出虚拟设备的末尾:

Given this information, the sbull driver can implement the actual data transfer with a simple memcpy call—our data is already in memory, after all. The function that performs this copy operation (sbull_transfer) also handles the scaling of sector sizes and ensures that we do not try to copy beyond the end of our virtual device:

static void sbull_transfer(struct sbull_dev *dev, 无符号长扇区,
        无符号长 nsect、char *buffer、int 写入)
{
    无符号长偏移量=扇区*KERNEL_SECTOR_SIZE;
    无符号长 nbytes = nsect*KERNEL_SECTOR_SIZE;

    if ((offset + nbytes) > dev->size) {
        printk (KERN_NOTICE "超出结束写入 (%ld %ld)\n", 偏移量, nbytes);
        返回;
    }
    如果(写)
        memcpy(dev->数据+偏移量,缓冲区,nbytes);
    别的
        memcpy(缓冲区, dev->数据 + 偏移量, nbytes);
}
static void sbull_transfer(struct sbull_dev *dev, unsigned long sector,
        unsigned long nsect, char *buffer, int write)
{
    unsigned long offset = sector*KERNEL_SECTOR_SIZE;
    unsigned long nbytes = nsect*KERNEL_SECTOR_SIZE;

    if ((offset + nbytes) > dev->size) {
        printk (KERN_NOTICE "Beyond-end write (%ld %ld)\n", offset, nbytes);
        return;
    }
    if (write)
        memcpy(dev->data + offset, buffer, nbytes);
    else
        memcpy(buffer, dev->data + offset, nbytes);
}

通过代码,sbull实现了一个完整、简单的基于 RAM 的磁盘设备。然而,由于多种原因,它对于许多类型的设备来说并不是一个现实的驱动程序。

With the code, sbull implements a complete, simple RAM-based disk device. It is not, however, a realistic driver for many types of devices, for a couple of reasons.

第一个原因是sbull同步执行请求,一次一个。高性能磁盘设备能够同时处理大量请求;然后,磁盘的板载控制器可以选择以最佳顺序执行它们(希望如此)。只要我们只处理队列中的第一个请求,就永远不可能在给定时间满足多个请求。能够处理多个请求需要对请求队列和结构有更深入的了解request;接下来的几节有助于建立这种理解。

The first of those reasons is that sbull executes requests synchronously, one at a time. High-performance disk devices are capable of having numerous requests outstanding at the same time; the disk's onboard controller can then choose to execute them in the optimal order (one hopes). As long as we process only the first request in the queue, we can never have multiple requests being fulfilled at a given time. Being able to work with more than one request requires a deeper understanding of request queues and the request structure; the next few sections help build that understanding.

然而,还有另一个问题需要考虑。当系统执行涉及磁盘上多个扇区的大型传输时,磁盘设备可以获得最佳性能。磁盘操作中成本最高的始终是读写磁头的定位;一旦完成,实际读取或写入数据所需的时间几乎可以忽略不计。设计和实现文件系统和虚拟内存子系统的开发人员了解这一点,因此他们尽最大努力在磁盘上连续定位相关数据,并在单个请求中传输尽可能多的扇区。块子系统在这方面也有帮助;请求队列包含大量逻辑,旨在查找相邻请求并将它们合并为更大的操作。

There is another issue to consider, however. The best performance is obtained from disk devices when the system performs large transfers involving multiple sectors that are located together on the disk. The highest cost in a disk operation is always the positioning of the read and write heads; once that is done, the time required to actually read or write the data is almost insignificant. The developers who design and implement filesystems and virtual memory subsystems understand this, so they do their best to locate related data contiguously on the disk and to transfer as many sectors as possible in a single request. The block subsystem also helps in this regard; request queues contain a great deal of logic aimed at finding adjacent requests and coalescing them into larger operations.

然而,sbull驱动程序承担了所有这些工作并简单地忽略它。一次仅传输一个缓冲区,这意味着最大的单次传输几乎永远不会超过单个页面的大小。块驱动程序可以做得更好,但它需要对request结构以及bio构建请求的结构有更深入的了解。

The sbull driver, however, takes all that work and simply ignores it. Only one buffer is transferred at a time, meaning that the largest single transfer is almost never going to exceed the size of a single page. A block driver can do much better than that, but it requires a deeper understanding of request structures and the bio structures from which requests are built.

接下来的几节将更深入地探讨块层如何完成其​​工作以及该工作产生的数据结构。

The next few sections delve more deeply into how the block layer does its job and the data structures that result from that work.

请求队列

Request Queues

从最简单的意义上来说,一个块请求队列正是:块 I/O 请求的队列。如果深入观察,就会发现请求队列是一个极其复杂的数据结构。幸运的是,驾驶员不必担心大部分复杂性。

In the simplest sense, a block request queue is exactly that: a queue of block I/O requests. If you look under the hood, a request queue turns out to be a surprisingly complex data structure. Fortunately, drivers need not worry about most of that complexity.

请求队列跟踪未完成的块 I/O 请求。但它们在创建这些请求的过程中也发挥着至关重要的作用。请求队列存储描述设备能够服务的请求类型的参数:它们的最大大小、请求中可以包含多少个单独的段、硬件扇区大小、对齐要求等。如果您的请求队列配置正确,它永远不应该向您提出您的设备无法处理的请求。

Request queues keep track of outstanding block I/O requests. But they also play a crucial role in the creation of those requests. The request queue stores parameters that describe what kinds of requests the device is able to service: their maximum size, how many separate segments may go into a request, the hardware sector size, alignment requirements, etc. If your request queue is properly configured, it should never present you with a request that your device cannot handle.

请求队列还实现了一个插件接口,允许使用多个 要使用的I/O 调度程序(或电梯)。I/O 调度程序的工作是以最大化性能的方式向驱动程序提出 I/O 请求。为此,大多数 I/O 调度程序会累积一批请求,将它们按照递增(或递减)的块索引顺序进行排序,并按该顺序将请求呈现给驱动程序。当给定一个排序的请求列表时,磁盘头会从磁盘的一端工作到另一端,就像一部完整的电梯沿一个方向移动,直到所有“请求”(等待下车的人)都完成为止。已经很满意了。2.6 内核包含一个““最后期限调度程序”,它努力确保每个请求在预设的最大时间内得到满足,以及“预期调度程序”,它实际上在读取请求之后短暂停止设备,以预期另一个相邻的读取几乎立即到达。截至撰写本文时,默认调度程序是预期调度程序,它似乎提供了最佳的交互式系统性能。

Request queues also implement a plug-in interface that allows the use of multiple I/O schedulers (or elevators) to be used. An I/O scheduler's job is to present I/O requests to your driver in a way that maximizes performance. To this end, most I/O schedulers accumulate a batch of requests, sort them into increasing (or decreasing) block index order, and present the requests to the driver in that order. The disk head, when given a sorted list of requests, works its way from one end of the disk to the other, much like a full elevator moves in a single direction until all of its "requests" (people waiting to get off) have been satisfied. The 2.6 kernel includes a "deadline scheduler," which makes an effort to ensure that every request is satisfied within a preset maximum time, and an "anticipatory scheduler," which actually stalls a device briefly after a read request in anticipation that another, adjacent read will arrive almost immediately. As of this writing, the default scheduler is the anticipatory scheduler, which seems to give the best interactive system performance.

I/O 调度程序还负责合并相邻的请求。当新的 I/O 请求交给调度程序时,它会在队列中搜索涉及相邻扇区的请求;如果找到一个并且生成的请求不会太大,则合并这两个请求。

The I/O scheduler is also charged with merging adjacent requests. When a new I/O request is handed to the scheduler, it searches the queue for requests involving adjacent sectors; if one is found and if the resulting request would not be too large, the two requests are merged.

请求队列的类型为struct request_queuerequest_queue_t。这种类型以及对其进行操作的许多函数在<linux/blkdev.h>中定义。如果您对请求队列的实现感兴趣,可以在drivers/block/ll_rw_block.celevator.c中找到大部分代码。

Request queues have a type of struct request_queue or request_queue_t. This type, and the many functions that operate on it, are defined in <linux/blkdev.h>. If you are interested in the implementation of request queues, you can find most of the code in drivers/block/ll_rw_block.c and elevator.c.

队列的创建和删除

Queue creation and deletion

正如我们在示例中看到的 在代码中,请求队列是必须由块 I/O 子系统创建的动态数据结构。创建和初始化请求队列的函数是:

As we saw in our example code, a request queue is a dynamic data structure that must be created by the block I/O subsystem. The function to create and initialize a request queue is:

request_queue_t *blk_init_queue(request_fn_proc *request, spinlock_t *lock);
request_queue_t *blk_init_queue(request_fn_proc *request, spinlock_t *lock);

当然,参数是该队列的请求函数和控制对该队列的访问的自旋锁。该函数分配内存(实际上是相当多的内存)并且可能因此失败;在尝试使用队列之前,您应该始终检查返回值。

The arguments are, of course, the request function for this queue and a spinlock that controls access to the queue. This function allocates memory (quite a bit of memory, actually) and can fail because of this; you should always check the return value before attempting to use the queue.

作为请求队列初始化的一部分,您可以将该字段queuedata(它是一个void *指针)设置为您喜欢的任何值。该字段相当于请求队列中private_data我们在其他结构中看到的字段。

As part of the initialization of a request queue, you can set the field queuedata (which is a void * pointer) to any value you like. This field is the request queue's equivalent to the private_data we have seen in other structures.

要将请求队列返回到系统(通常在模块卸载时),请调用 blk_cleanup_queue

To return a request queue to the system (at module unload time, generally), call blk_cleanup_queue :

无效blk_cleanup_queue(request_queue_t *);
void blk_cleanup_queue(request_queue_t *);

在此调用之后,您的驱动程序不会再看到来自给定队列的请求,并且不应再次引用它。

After this call, your driver sees no more requests from the given queue and should not reference it again.

排队功能

Queueing functions

有一个非常小的集合 用于操纵队列上的请求的函数——至少就驱动程序而言是这样。在调用这些函数之前,您必须持有队列锁。

There is a very small set of functions for the manipulation of requests on queues—at least, as far as drivers are concerned. You must hold the queue lock before you call these functions.

返回下一个要处理的请求的函数是 elv_next_request

The function that returns the next request to process is elv_next_request :

结构请求 *elv_next_request(request_queue_t *queue);
struct request *elv_next_request(request_queue_t *queue);

我们已经在简单的sbull示例中看到了这个函数。它返回一个指向下一个要处理的请求(由 I/O 调度程序确定)的指针,或者NULL是否没有更多请求需要处理。 elv_next_request将请求留在队列中,但将其标记为活动状态;一旦开始执行该标记,该标记就会阻止 I/O 调度程序尝试将其他请求与该请求合并。

We have already seen this function in the simple sbull example. It returns a pointer to the next request to process (as determined by the I/O scheduler) or NULL if no more requests remain to be processed. elv_next_request leaves the request on the queue but marks it as being active; this mark prevents the I/O scheduler from attempting to merge other requests with this one once you start to execute it.

要实际从队列中删除请求,请使用 blkdev_dequeue_request

To actually remove a request from a queue, use blkdev_dequeue_request :

void blkdev_dequeue_request(struct request *req);
void blkdev_dequeue_request(struct request *req);

如果您的驱动程序同时对同一队列中的多个请求进行操作,则必须以这种方式将它们出队。

If your driver operates on multiple requests from the same queue simultaneously, it must dequeue them in this manner.

如果出于某种原因需要将出队的请求放回到队列中,您可以调用:

Should you need to put a dequeued request back on the queue for some reason, you can call:

void elv_requeue_request(request_queue_t *queue, struct request *req);
void elv_requeue_request(request_queue_t *queue, struct request *req);

队列控制功能

Queue control functions

块层导出一个 驱动程序可以使用一组函数来控制请求队列的操作方式。这些功能包括:

The block layer exports a set of functions that can be used by a driver to control how a request queue operates. These functions include:

void blk_stop_queue(request_queue_t *queue);

void blk_start_queue(request_queue_t *queue);
void blk_stop_queue(request_queue_t *queue);

void blk_start_queue(request_queue_t *queue);

如果您的设备已达到无法处理更多未完成命令的状态,您可以调用blk_stop_queue来告诉块层。在此调用之后,您的请求函数将不会被调用,直到您调用blk_start_queue。不用说,当您的设备可以处理更多请求时,您不应该忘记重新启动队列。调用这些函数中的任何一个时都必须保持队列锁。

If your device has reached a state where it can handle no more outstanding commands, you can call blk_stop_queue to tell the block layer. After this call, your request function will not be called until you call blk_start_queue. Needless to say, you should not forget to restart the queue when your device can handle more requests. The queue lock must be held when calling either of these functions.

void blk_queue_bounce_limit(request_queue_t *queue, u64 dma_addr);
void blk_queue_bounce_limit(request_queue_t *queue, u64 dma_addr);

告诉内核设备可以执行 DMA 的最高物理地址的函数。如果传入的请求包含对超出限制的内存的引用,则将使用反弹缓冲区来执行该操作;当然,这是一种执行块 I/O 的昂贵方法,应尽可能避免。您可以在此参数中提供任何合理的物理地址,或使用预定义的符号BLK_BOUNCE_HIGH (使用 高内存页面的反弹缓冲区),BLK_BOUNCE_ISA(驱动程序只能 DMA 到 16 MB ISA 区域),或者BLK_BOUNCE_ANY(驱动程序可以对任何地址执行 DMA)。默认值为BLK_BOUNCE_HIGH

Function that tells the kernel the highest physical address to which your device can perform DMA. If a request comes in containing a reference to memory above the limit, a bounce buffer will be used for the operation; this is, of course, an expensive way to perform block I/O and should be avoided whenever possible. You can provide any reasonable physical address in this argument, or make use of the predefined symbols BLK_BOUNCE_HIGH (use bounce buffers for high-memory pages), BLK_BOUNCE_ISA (the driver can DMA only into the 16-MB ISA zone), or BLK_BOUNCE_ANY (the driver can perform DMA to any address). The default value is BLK_BOUNCE_HIGH.

void blk_queue_max_sectors(request_queue_t *queue, unsigned short max);

void blk_queue_max_phys_segments(request_queue_t *queue, unsigned short max);

void blk_queue_max_hw_segments(request_queue_t *queue, unsigned short max);

void blk_queue_max_segment_size(request_queue_t *queue, unsigned int max);
void blk_queue_max_sectors(request_queue_t *queue, unsigned short max);

void blk_queue_max_phys_segments(request_queue_t *queue, unsigned short max);

void blk_queue_max_hw_segments(request_queue_t *queue, unsigned short max);

void blk_queue_max_segment_size(request_queue_t *queue, unsigned int max);

设置描述该设备可以满足的请求的参数的函数。blk_queue_max_sectors可用于设置(512 字节)扇区中任何请求的最大大小;默认值为 255。 blk_queue_max_phys_segmentsblk_queue_max_hw_segments都控制单个请求中可以包含多少个物理段(系统内存中的不相邻区域)。使用blk_queue_max_phys_segments来表示您的驱动程序准备处理多少个段;例如,这可以是静态分配的分散列表的大小。 相反, blk_queue_max_hw_segments是设备本身可以处理的最大段数。这两个参数都默认为 128。最后, blk_queue_max_segment_size告诉内核请求的任何单个段可以有多大(以字节为单位);默认值为 65,536 字节。

Functions that set parameters describing the requests that can be satisfied by this device. blk_queue_max_sectors can be used to set the maximum size of any request in (512-byte) sectors; the default is 255. blk_queue_max_phys_segments and blk_queue_max_hw_segments both control how many physical segments (nonadjacent areas in system memory) may be contained within a single request. Use blk_queue_max_phys_segments to say how many segments your driver is prepared to cope with; this may be the size of a staticly allocated scatterlist, for example. blk_queue_max_hw_segments, in contrast, is the maximum number of segments that the device itself can handle. Both of these parameters default to 128. Finally, blk_queue_max_segment_size tells the kernel how large any individual segment of a request can be in bytes; the default is 65,536 bytes.

blk_queue_segment_boundary(request_queue_t *queue, unsigned long mask);
blk_queue_segment_boundary(request_queue_t *queue, unsigned long mask);

某些设备无法处理跨越特定大小内存边界的请求;如果您的设备是其中之一,请使用此函数告诉内核该边界。例如,如果您的设备在处理跨越 4 MB 边界的请求时遇到问题,请传入0x3fffff. 默认掩码是0xffffffff.

Some devices cannot handle requests that cross a particular size memory boundary; if your device is one of those, use this function to tell the kernel about that boundary. For example, if your device has trouble with requests that cross a 4-MB boundary, pass in a mask of 0x3fffff. The default mask is 0xffffffff.

void blk_queue_dma_alignment(request_queue_t *queue, int mask);
void blk_queue_dma_alignment(request_queue_t *queue, int mask);

告诉内核您的设备对 DMA 传输施加的内存对齐约束的函数。所有请求都是使用给定的对齐方式创建的,并且请求的长度也与该对齐方式匹配。默认掩码为0x1ff,这会导致所有请求在 512 字节边界上对齐。

Function that tells the kernel about the memory alignment constraints your device imposes on DMA transfers. All requests are created with the given alignment, and the length of the request also matches the alignment. The default mask is 0x1ff, which causes all requests to be aligned on 512-byte boundaries.

void blk_queue_hardsect_size(request_queue_t *queue, unsigned short max);
void blk_queue_hardsect_size(request_queue_t *queue, unsigned short max);

告诉内核您设备的硬件扇区大小。内核生成的所有请求都是该大小的倍数并且正确对齐。然而,块层和驱动程序之间的所有通信仍然以 512 字节扇区表示。

Tells the kernel about your device's hardware sector size. All requests generated by the kernel are a multiple of this size and are properly aligned. All communications between the block layer and the driver continues to be expressed in 512-byte sectors, however.

请求的剖析

The Anatomy of a Request

在我们的简单示例中,我们遇到了该request 结构。然而,我们仅仅触及了这种复杂数据结构的表面。在本节中,我们将详细了解 Linux 内核中块 I/O 请求的表示方式。

In our simple example, we encountered the request structure. However, we have barely scratched the surface of that complicated data structure. In this section, we look, in some detail, at how block I/O requests are represented in the Linux kernel.

每个request结构代表一个块 I/O 请求,尽管它可能是通过在更高级别上合并多个独立请求而形成的。对于任何特定请求要传输的扇区可以分布在整个主存储器中,尽管它们总是对应于块设备上的一组连续扇区。该请求被表示为一组段,每个段对应一个内存缓冲区。内核可能会合并涉及磁盘上相邻扇区的多个请求,但它绝不会在单个请求中组合读取和写入操作。request结构。如果结果违反上一节中描述的任何请求队列限制,内核还会确保不合并请求。

Each request structure represents one block I/O request, although it may have been formed through a merger of several independent requests at a higher level. The sectors to be transferred for any particular request may be distributed throughout main memory, although they always correspond to a set of consecutive sectors on the block device. The request is represented as a set of segments, each of which corresponds to one in-memory buffer. The kernel may join multiple requests that involve adjacent sectors on the disk, but it never combines read and write operations within a single request structure. The kernel also makes sure not to combine requests if the result would violate any of the request queue limits described in the previous section.

结构request本质上是作为bio结构的链接列表与一些内务信息相结合来实现的,以使驱动程序能够在处理请求时跟踪其位置。该bio结构是块 I/O 请求的一部分的低级描述;我们现在来看看。

A request structure is implemented, essentially, as a linked list of bio structures combined with some housekeeping information to enable the driver to keep track of its position as it works through the request. The bio structure is a low-level description of a portion of a block I/O request; we take a look at it now.

生物结构

The bio structure

当内核以文件系统、虚拟内存子系统或系统调用的形式决定必须将一组块传输到块 I/O 设备或从块 I/O 设备传输出一组块时;它组合了一个bio结构来描述该操作。然后将该结构交给块 I/O 代码,块 I/O 代码将其合并到现有request结构中,或者根据需要创建一个新结构。该bio结构包含块驱动程序执行请求所需的所有内容,而无需引用导致启动该请求的用户空间进程。

When the kernel, in the form of a filesystem, the virtual memory subsystem, or a system call, decides that a set of blocks must be transferred to or from a block I/O device; it puts together a bio structure to describe that operation. That structure is then handed to the block I/O code, which merges it into an existing request structure or, if need be, creates a new one. The bio structure contains everything that a block driver needs to carry out the request without reference to the user-space process that caused that request to be initiated.

bio 结构体在<linux/bio.h>中定义,包含许多可能对驱动程序作者有用的字段:

The bio structure, which is defined in <linux/bio.h>, contains a number of fields that may be of use to driver authors:

sector_t bi_sector;
sector_t bi_sector;

为此传输的第一个(512 字节)扇区bio

The first (512-byte) sector to be transferred for this bio.

unsigned int bi_size;
unsigned int bi_size;

要传输的数据大小(以字节为单位)。相反,它通常更容易使用bio_sectors(bio),这是一个给出扇区大小的宏。

The size of the data to be transferred, in bytes. Instead, it is often easier to use bio_sectors(bio), a macro that gives the size in sectors.

unsigned long bi_flags;
unsigned long bi_flags;

一组描述的标志biobio_data_dir(bio)如果这是一个写请求,则设置最低有效位(尽管应该使用宏 而不是直接查看标志)。

A set of flags describing the bio; the least significant bit is set if this is a write request (although the macro bio_data_dir(bio) should be used instead of looking at the flags directly).

unsigned short bio_phys_segments;

unsigned short bio_hw_segments;
unsigned short bio_phys_segments;

unsigned short bio_hw_segments;

分别是该 BIO 中包含的物理段数和 DMA 映射完成后硬件看到的段数。

The number of physical segments contained within this BIO and the number of segments seen by the hardware after DMA mapping is done, respectively.

然而, a 的核心bio是一个名为 bi_io_vec ,它由以下结构组成:

The core of a bio, however, is an array called bi_io_vec , which is made up of the following structure:

结构bio_vec {
        结构页*bv_page;
        无符号整型 bv_len;
        无符号整型 bv_offset;
};
struct bio_vec {
        struct page     *bv_page;
        unsigned int    bv_len;
        unsigned int    bv_offset;
};

图 16-1显示了这些结构如何结合在一起。正如您所看到的,当块 I/O 请求转换为结构时bio,它已被分解为物理内存的各个页面。驱动程序需要做的就是单步执行这个结构数组(有很多bi_vcnt),并在每个页面内传输数据(但仅从len开始的字节offset)。

Figure 16-1 shows how these structures all tie together. As you can see, by the time a block I/O request is turned into a bio structure, it has been broken down into individual pages of physical memory. All a driver needs to do is to step through this array of structures (there are bi_vcnt of them), and transfer data within each page (but only len bytes starting at offset).

生物结构

图 16-1。生物结构

Figure 16-1. The bio structure

不鼓励直接使用bi_io_vec数组,因为内核开发人员将来能够bio在不破坏事物的情况下更改结构。为此,提供了一组宏来简化使用该bio结构的过程。从 with 开始bio_for_each_segment,它简单地循环遍历bi_io_vec数组中每个未处理的条目。该宏应按如下方式使用:

Working directly with the bi_io_vec array is discouraged in the interest of kernel developers being able to change the bio structure in the future without breaking things. To that end, a set of macros has been provided to ease the process of working with the bio structure. The place to start is with bio_for_each_segment, which simply loops through every unprocessed entry in the bi_io_vec array. This macro should be used as follows:

int 符号;
结构bio_vec*bvec;

bio_for_each_segment(bvec,bio,segno){
    /* 对这个段做一些事情
}
int segno;
struct bio_vec *bvec;

bio_for_each_segment(bvec, bio, segno) {
    /* Do something with this segment
}

在此循环中,bvec指向当前 bio_vec条目,并且segno是当前段号。这些值可用于设置 DMA 传输(第 16.3.5.2 节中描述了 使用blk_rq_map_sg 的替代方法)。如果需要直接访问页面,首先应确保存在正确的内核虚拟地址;为此,您可以使用:

Within this loop, bvec points to the current bio_vec entry, and segno is the current segment number. These values can be used to set up DMA transfers (an alternative way using blk_rq_map_sg is described in Section 16.3.5.2). If you need to access the pages directly, you should first ensure that a proper kernel virtual address exists; to that end, you can use:

char *_ _bio_kmap_atomic(struct bio *bio, int i, enum km_type 类型);
void _ _bio_kunmap_atomic(char *buffer, enum km_type 类型);
char *_ _bio_kmap_atomic(struct bio *bio, int i, enum km_type type);
void _ _bio_kunmap_atomic(char *buffer, enum km_type type);

这个低级函数允许您直接映射给定中找到的缓冲区 bio_vec,如索引所示i。创建原子 kmmap;调用者必须提供适当的槽来使用(如第 15.1.4节中所述)。

This low-level function allows you to directly map the buffer found in a given bio_vec, as indicated by the index i. An atomic kmap is created; the caller must provide the appropriate slot to use (as described in the section Section 15.1.4).

块层还在结构内维护一组指针bio来跟踪当前的 请求处理的状态。存在几个宏来提供对该状态的访问:

The block layer also maintains a set of pointers within the bio structure to keep track of the current state of request processing. Several macros exist to provide access to that state:

struct page *bio_page(struct bio *bio);
struct page *bio_page(struct bio *bio);

返回指向表示page接下来要传输的页面的结构的指针。

Returns a pointer to the page structure representing the page to be transferred next.

int bio_offset(struct bio *bio);
int bio_offset(struct bio *bio);

返回要传输的数据在页内的偏移量。

Returns the offset within the page for the data to be transferred.

int bio_cur_sectors(struct bio *bio);
int bio_cur_sectors(struct bio *bio);

返回要从当前页转出的扇区数。

Returns the number of sectors to be transferred out of the current page.

char *bio_data(struct bio *bio);
char *bio_data(struct bio *bio);

返回指向要传输的数据的内核逻辑地址。请注意,仅当相关页面不在高端内存中时,该地址才可用;在其他情况下调用它是一个错误。默认情况下,块子系统不会将高内存缓冲区传递给驱动程序,但如果您使用blk_queue_bounce_limit更改了该设置,则可能不应该使用bio_data.

Returns a kernel logical address pointing to the data to be transferred. Note that this address is available only if the page in question is not located in high memory; calling it in other situations is a bug. By default, the block subsystem does not pass high-memory buffers to your driver, but if you have changed that setting with blk_queue_bounce_limit, you probably should not be using bio_data.

char *bio_kmap_irq(struct bio *bio, unsigned long *flags);

void bio_kunmap_irq(char *buffer, unsigned long *flags);
char *bio_kmap_irq(struct bio *bio, unsigned long *flags);

void bio_kunmap_irq(char *buffer, unsigned long *flags);

bio_kmap_irq返回任何缓冲区的内核虚拟地址,无论它驻留在高内存还是低内存中。使用原子 kmap,因此当该映射处于活动状态时,您的驱动程序无法睡眠。使用 bio_kunmap_irq取消映射缓冲区。请注意,flags这里的参数是通过指针传递的。另请注意,由于使用了原子 kmmap,因此您无法一次映射多个段。

bio_kmap_irq returns a kernel virtual address for any buffer, regardless of whether it resides in high or low memory. An atomic kmap is used, so your driver cannot sleep while this mapping is active. Use bio_kunmap_irq to unmap the buffer. Note that the flags argument is passed by pointer here. Note also that since an atomic kmap is used, you cannot map more than one segment at a time.

刚才描述的所有函数都访问“当前”缓冲区——据内核所知,尚未传输的第一个缓冲区。bio驱动程序通常希望在对其中任何一个发出完成信号之前处理多个缓冲区(使用end_that_request_first,稍后将进行描述),因此这些函数通常没有用处。还存在其他几个用于处理结构内部的宏bio(有关详细信息,请参阅<linux/bio.h> )。

All of the functions just described access the "current" buffer—the first buffer that, as far as the kernel knows, has not been transferred. Drivers often want to work through several buffers in the bio before signaling completion on any of them (with end_that_request_first, to be described shortly), so these functions are often not useful. Several other macros exist for working with the internals of the bio structure (see <linux/bio.h> for details).

请求结构字段

Request structure fields

现在我们已经了解了结构的工作原理bio,我们可以深入struct request了解请求处理的工作原理。该结构体的字段包括:

Now that we have an idea of how the bio structure works, we can get deep into struct request and see how request processing works. The fields of this structure include:

sector_t hard_sector;

unsigned long hard_nr_sectors;

unsigned int hard_cur_sectors;
sector_t hard_sector;

unsigned long hard_nr_sectors;

unsigned int hard_cur_sectors;

跟踪驱动程序尚未完成的扇区的字段。第一个尚未传输的扇区存储在hard_sector,尚未传输的扇区总数为hard_nr_sectors,当前剩余的扇区数biohard_cur_sectors。这些字段仅供在块子系统内使用;驾驶员不应使用它们。

Fields that track the sectors that the driver has yet to complete. The first sector that has not been transferred is stored in hard_sector, the total number of sectors yet to transfer is in hard_nr_sectors, and the number of sectors remaining in the current bio is hard_cur_sectors. These fields are intended for use only within the block subsystem; drivers should not make use of them.

struct bio *bio;
struct bio *bio;

biobio是该请求的结构的链表。您不应该直接访问该字段;使用rq_for_each_bio(稍后描述)代替。

bio is the linked list of bio structures for this request. You should not access this field directly; use rq_for_each_bio (described later) instead.

char *buffer;
char *buffer;

本章前面的简单驱动程序示例使用此字段来查找传输缓冲区。随着我们更深入的理解,我们现在可以看到这个字段只是对当前的 .bio_databio调用的结果。

The simple driver example earlier in this chapter used this field to find the buffer for the transfer. With our deeper understanding, we can now see that this field is simply the result of calling bio_data on the current bio.

unsigned short nr_phys_segments;
unsigned short nr_phys_segments;

合并相邻页后,该请求在物理内存中占用的不同段的数量。

The number of distinct segments occupied by this request in physical memory after adjacent pages have been merged.

struct list_head queuelist;
struct list_head queuelist;

将请求链接到请求队列的链表结构(如第 11.5 节中所述)。如果(且仅当)您使用blkdev_dequeue_request从队列中删除请求,您可以使用此列表头来跟踪驱动程序维护的内部列表中的请求。

The linked-list structure (as described in Section 11.5) that links the request into the request queue. If (and only if) you remove the request from the queue with blkdev_dequeue_request, you may use this list head to track the request in an internal list maintained by your driver.

图 16-2显示了请求结构及其组件 bio 结构如何组合在一起。图中,请求已经部分得到满足;和cbio字段buffer指向尚未传输的第一个生物。

Figure 16-2 shows how the request structure and its component bio structures fit together. In the figure, the request has been partially satisfied; the cbio and buffer fields point to the first bio that has not yet been transferred.

包含部分处理的请求的请求队列

图 16-2。包含部分处理的请求的请求队列

Figure 16-2. A request queue with a partially processed request

该结构内还有许多其他字段request ,但本节中的列表对于大多数驱动程序编写者来说应该足够了。

There are many other fields inside the request structure, but the list in this section should be enough for most driver writers.

障碍请求

Barrier requests

块层对请求重新排序在您的驱动程序看到它们以提高 I/O 性能之前。如果有理由的话,您的司机也可以重新排序请求。通常,这种重新排序是通过将多个请求传递给驱动器并让硬件找出最佳排序来实现的。然而,无限制地重新排序请求存在一个问题:某些应用程序需要保证某些操作将在其他操作开始之前完成。例如,关系数据库管理器必须在对数据库内容执行事务之前绝对确保其日志信息已刷新到驱动器。目前在大多数 Linux 系统上使用的日志文件系统具有非常相似的排序约束。

The block layer reorders requests before your driver sees them to improve I/O performance. Your driver, too, can reorder requests if there is a reason to do so. Often, this reordering happens by passing multiple requests to the drive and letting the hardware figure out the optimal ordering. There is a problem with unrestricted reordering of requests, however: some applications require guarantees that certain operations will complete before others are started. Relational database managers, for example, must be absolutely sure that their journaling information has been flushed to the drive before executing a transaction on the database contents. Journaling filesystems, which are now in use on most Linux systems, have very similar ordering constraints. If the wrong operations are reordered, the result can be severe, undetected data corruption.

2.6 块层通过屏障请求的概念解决了这个问题。如果请求标记为REQ_HARDBARRER标志,必须在启动任何后续请求之前将其写入驱动器。“写入驱动器”是指数据必须实际驻留在物理介质上并持久存在。许多驱动器执行写入请求的缓存;这种缓存可以提高性能,但它可能会破坏屏障请求的目的。如果在关键数据仍位于驱动器缓存中时发生电源故障,即使驱动器已报告完成,该数据仍会丢失。因此,实现屏障请求的驱动程序必须采取措施强制驱动器将数据实际写入介质。

The 2.6 block layer addresses this problem with the concept of a barrier request. If a request is marked with the REQ_HARDBARRER flag, it must be written to the drive before any following request is initiated. By "written to the drive," we mean that the data must actually reside and be persistent on the physical media. Many drives perform caching of write requests; this caching improves performance, but it can defeat the purpose of barrier requests. If a power failure occurs when the critical data is still sitting in the drive's cache, that data is still lost even if the drive has reported completion. So a driver that implements barrier requests must take steps to force the drive to actually write the data to the media.

如果您的驱动程序尊重屏障请求,则第一步是将此事实通知块层。屏障处理是另一个请求队列;它设置为:

If your driver honors barrier requests, the first step is to inform the block layer of this fact. Barrier handling is another of the request queues; it is set with:

void blk_queue_ordered(request_queue_t *queue, int flag);
void blk_queue_ordered(request_queue_t *queue, int flag);

要指示您的驱动程序实现屏障请求,请将flag参数设置为非零值。

To indicate that your driver implements barrier requests, set the flag parameter to a nonzero value.

屏障请求的实际实现只是测试结构中相关标志的问题request。提供了一个宏来执行此测试:

The actual implementation of barrier requests is simply a matter of testing for the associated flag in the request structure. A macro has been provided to perform this test:

int blk_barrier_rq(struct request *req);
int blk_barrier_rq(struct request *req);

如果该宏返回非零值,则该请求是屏障请求。根据您的硬件工作方式,您可能必须停止从队列中获取请求,直到屏障请求完成。其他驱动器可以自己理解屏障请求;在这种情况下,您的驱动程序所要做的就是为这些驱动器发出正确的操作。

If this macro returns a nonzero value, the request is a barrier request. Depending on how your hardware works, you may have to stop taking requests from the queue until the barrier request has been completed. Other drives can understand barrier requests themselves; in this case, all your driver has to do is to issue the proper operations for those drives.

不可重试的请求

Nonretryable requests

经常阻止司机尝试重试第一次失败的请求。此行为可以使系统更加可靠,并有助于避免数据丢失。然而,内核有时会将请求标记为不可重试。如果此类请求无法在第一次尝试时执行,则应尽快失败。

Block drivers often attempt to retry requests that fail the first time. This behavior can lead to a more reliable system and help to avoid data loss. The kernel, however, sometimes marks requests as not being retryable. Such requests should simply fail as quickly as possible if they cannot be executed on the first try.

如果您的驱动程序正在考虑重试失败的请求,则应首先调用:

If your driver is considering retrying a failed request, it should first make a call to:

int blk_noretry_request(struct request *req);
int blk_noretry_request(struct request *req);

如果此宏返回非零值,则您的驱动程序应该简单地使用错误代码中止请求,而不是重试。

If this macro returns a nonzero value, your driver should simply abort the request with an error code instead of retrying it.

请求完成函数

Request Completion Functions

正如我们将看到的,处理request结构的几种不同方式。然而,它们都使用几个通用函数来处理 I/O 请求或部分请求的完成。这两个函数都是原子函数,可以从原子上下文中安全地调用。

There are, as we will see, several different ways of working through a request structure. All of them make use of a couple of common functions, however, which handle the completion of an I/O request or parts of a request. Both of these functions are atomic and can be safely called from an atomic context.

当您的设备完成 I/O 请求中的部分或全部扇区传输时,它必须通知块子系统:

When your device has completed transferring some or all of the sectors in an I/O request, it must inform the block subsystem with:

int end_that_request_first(struct request *req, int success, int count);
int end_that_request_first(struct request *req, int success, int count);

count此函数告诉块代码您的驱动程序已完成从上次停止处开始的扇区传输 。如果 I/O 成功,则传递success1; 否则通过0。请注意,您必须按从第一个扇区到最后一个扇区的顺序发出完成信号;如果您的驱动程序和设备以某种方式合谋无序地完成请求,则您必须存储无序完成状态,直到中间扇区被传输为止。

This function tells the block code that your driver has finished with the transfer of count sectors starting where you last left off. If the I/O was successful, pass success as 1; otherwise pass 0. Note that you must signal completion in order from the first sector to the last; if your driver and device somehow conspire to complete requests out of order, you have to store the out-of-order completion status until the intervening sectors have been transferred.

end_that_request_first的返回值指示该请求中的所有扇区是否已被传输。返回值 0表示所有扇区均已传输并且请求已完成。此时,您必须使用 blkdev_dequeue_request使请求出队(如果您尚未这样做)并将其传递给:

The return value from end_that_request_first is an indication of whether all sectors in this request have been transferred or not. A return value of 0 means that all sectors have been transferred and that the request is complete. At that point, you must dequeue the request with blkdev_dequeue_request (if you have not already done so) and pass it to:

void end_that_request_last(struct request *req);
void end_that_request_last(struct request *req);

end_that_request_last通知正在等待请求的人已完成并回收该request 结构;必须在持有队列锁的情况下调用它。

end_that_request_last informs whoever is waiting for the request that it has completed and recycles the request structure; it must be called with the queue lock held.

在我们简单的sbull示例中,我们没有使用上述任何函数。该示例称为end_request。为了显示此调用的效果,下面是2.6.10 内核中的完整end_request函数:

In our simple sbull example, we didn't use any of the above functions. That example, instead, is called end_request. To show the effects of this call, here is the entire end_request function as seen in the 2.6.10 kernel:

void end_request(struct request *req, int uptodate)
{
    if (!end_that_request_first(req, uptodate, req->hard_cur_sectors)) {
        add_disk_randomness(req->rq_disk);
        blkdev_dequeue_request(req);
        end_that_request_last(请求);
    }
}
void end_request(struct request *req, int uptodate)
{
    if (!end_that_request_first(req, uptodate, req->hard_cur_sectors)) {
        add_disk_randomness(req->rq_disk);
        blkdev_dequeue_request(req);
        end_that_request_last(req);
    }
}

函数add_disk_randomness利用块I/O请求的时序为系统的随机数池贡献熵;仅当磁盘的计时确实是随机的时才应调用它。对于大多数机械设备来说都是如此,但对于基于内存的虚拟设备(例如sbull )来说却并非如此。因此,下一节中显示的更复杂的sbull版本不会调用add_disk_randomness

The function add_disk_randomness uses the timing of block I/O requests to contribute entropy to the system's random number pool; it should be called only if the disk's timing is truly random. That is true for most mechanical devices, but it is not true for a memory-based virtual device, such as sbull. For this reason, the more complicated version of sbull shown in the next section does not call add_disk_randomness.

使用 BIOS

Working with bios

你现在已经足够了解了 编写一个直接与bio构成请求的结构一起工作的块驱动程序。然而,一个例子可能会有所帮助。如果 sbull驱动程序加载的request_mode参数设置为1,它会注册一个bio感知请求 函数,而不是我们上面看到的简单函数。该函数如下所示:

You now know enough to write a block driver that works directly with the bio structures that make up a request. An example might help, however. If the sbull driver is loaded with the request_mode parameter set to 1, it registers a bio-aware request function instead of the simple function we saw above. That function looks like this:

静态无效 sbull_full_request(request_queue_t *q)
{
    结构请求 *req;
    int扇区_xferred;
    struct sbull_dev *dev = q->queuedata;

    while ((req = elv_next_request(q)) != NULL) {
        if (!blk_fs_request(req)) {
            printk (KERN_NOTICE "跳过非 fs 请求\n");
            结束请求(请求,0);
            继续;
        }
        ectors_xferred = sbull_xfer_request(dev, req);
        如果(!end_that_request_first(req,1,sectors_xferred)){
            blkdev_dequeue_request(req);
            end_that_request_last(请求);
        }
    }
}
static void sbull_full_request(request_queue_t *q)
{
    struct request *req;
    int sectors_xferred;
    struct sbull_dev *dev = q->queuedata;

    while ((req = elv_next_request(q)) != NULL) {
        if (! blk_fs_request(req)) {
            printk (KERN_NOTICE "Skip non-fs request\n");
            end_request(req, 0);
            continue;
        }
        sectors_xferred = sbull_xfer_request(dev, req);
        if (! end_that_request_first(req, 1, sectors_xferred)) {
            blkdev_dequeue_request(req);
            end_that_request_last(req);
        }
    }
}

该函数只是接受每个请求,将其传递给 sbull_xfer_request ,然后使用end_that_request_first完成它 ,如果需要,还 使用 ​​end_that_request_last完成它。因此,该函数正在处理问题的高级队列和请求管理部分。然而,实际执行请求的工作落在sbull_xfer_request身上:

This function simply takes each request, passes it to sbull_xfer_request, then completes it with end_that_request_first and, if necessary, end_that_request_last. Thus, this function is handling the high-level queue and request management parts of the problem. The job of actually executing a request, however, falls to sbull_xfer_request:

静态 int sbull_xfer_request(结构 sbull_dev *dev,结构请求 *req)
{
    结构生物*生物;
    int nsect = 0;
    
    rq_for_each_bio(生物,要求){
        sbull_xfer_bio(开发,生物);
        nsect += bio->bi_size/KERNEL_SECTOR_SIZE;
    }
    返回 nsect;
}
static int sbull_xfer_request(struct sbull_dev *dev, struct request *req)
{
    struct bio *bio;
    int nsect = 0;
    
    rq_for_each_bio(bio, req) {
        sbull_xfer_bio(dev, bio);
        nsect += bio->bi_size/KERNEL_SECTOR_SIZE;
    }
    return nsect;
}

这里我们引入另一个宏:rq_for_each_bio。正如您所期望的,这个宏只是简单地遍历bio 请求中的每个结构,为我们提供一个指针,我们可以将其传递给 sbull_xfer_bio进行传输。该函数如下所示:

Here we introduce another macro: rq_for_each_bio. As you might expect, this macro simply steps through each bio structure in the request, giving us a pointer that we can pass to sbull_xfer_bio for the transfer. That function looks like:

静态 int sbull_xfer_bio(结构 sbull_dev *dev, 结构bio *bio)
{
    整数我;
    结构bio_vec*bvec;
    ector_t 扇区 = bio->bi_sector;

    /* 独立地执行每个段。*/
    bio_for_each_segment(bvec,bio,i){
        char *buffer = _ _bio_kmap_atomic(bio, i, KM_USER0);
        sbull_transfer(dev, 扇区, bio_cur_sectors(bio),
                缓冲区,bio_data_dir(bio) == WRITE);
        扇区+= bio_cur_sectors(bio);
        _ _bio_kunmap_atomic(bio, KM_USER0);
    }
    返回0;/* 永远“成功” */
}
static int sbull_xfer_bio(struct sbull_dev *dev, struct bio *bio)
{
    int i;
    struct bio_vec *bvec;
    sector_t sector = bio->bi_sector;

    /* Do each segment independently. */
    bio_for_each_segment(bvec, bio, i) {
        char *buffer = _ _bio_kmap_atomic(bio, i, KM_USER0);
        sbull_transfer(dev, sector, bio_cur_sectors(bio),
                buffer, bio_data_dir(bio) =  = WRITE);
        sector += bio_cur_sectors(bio);
        _ _bio_kunmap_atomic(bio, KM_USER0);
    }
    return 0; /* Always "succeed" */
}

该函数简单地遍历结构中的每个段bio,获取内核虚拟地址来访问缓冲区,然后调用我们之前看到的相同的sbull_transfer函数来复制数据。

This function simply steps through each segment in the bio structure, gets a kernel virtual address to access the buffer, then calls the same sbull_transfer function we saw earlier to copy the data over.

每个设备都有自己的需求,但是,作为一般规则,刚刚显示的代码应该作为许多需要挖掘结构的情况的模型bio

Each device has its own needs, but, as a general rule, the code just shown should serve as a model for many situations where digging through the bio structures is needed.

块请求和 DMA

Block requests and DMA

如果您正在从事高性能块驱动程序,您很可能会使用 DMA 进行实际数据传输。块驱动程序当然可以逐步遍历这些bio结构,如上所述,为每个结构创建一个 DMA 映射,并将结果传递给设备。然而,如果您的设备可以分散/聚集 I/O,则有一种更简单的方法。功能:

If you are working on a high-performance block driver, chances are you will be using DMA for the actual data transfers. A block driver can certainly step through the bio structures, as described above, create a DMA mapping for each one, and pass the result to the device. There is an easier way, however, if your device can do scatter/gather I/O. The function:

int blk_rq_map_sg(request_queue_t *queue, struct request *req,
                  结构分散列表*列表);
int blk_rq_map_sg(request_queue_t *queue, struct request *req, 
                  struct scatterlist *list);

list使用给定请求中的完整段集填充给定。内存中相邻的段在插入分散列表之前会合并,因此您无需尝试自己检测它们。返回值是列表中的条目数。该函数还在其第三个参数中传回适合传递给 dma_map_sg 的分散列表。(有关 dma_map_sg的更多信息,请参阅第 15.4.4.7 节。)

fills in the given list with the full set of segments from the given request. Segments that are adjacent in memory are coalesced prior to insertion into the scatterlist, so you need not try to detect them yourself. The return value is the number of entries in the list. The function also passes back, in its third argument, a scatterlist suitable for passing to dma_map_sg. (See Section 15.4.4.7 for more information on dma_map_sg.)

您的驱动程序必须在调用blk_rq_map_sg之前为分散列表分配存储空间 。该列表必须能够容纳至少与请求的物理段一样多的条目;该struct request字段nr_phys_segments保存该计数,该计数不会超过 blk_queue_max_phys_segments指定的最大物理段数。

Your driver must allocate the storage for the scatterlist before calling blk_rq_map_sg. The list must be able to hold at least as many entries as the request has physical segments; the struct request field nr_phys_segments holds that count, which will not exceed the maximum number of physical segments specified with blk_queue_max_phys_segments.

如果您不希望blk_rq_map_sg合并相邻段,则可以通过以下调用更改默认行为:

If you do not want blk_rq_map_sg to coalesce adjacent segments, you can change the default behavior with a call such as:

clear_bit(QUEUE_FLAG_CLUSTER, &queue->queue_flags);
clear_bit(QUEUE_FLAG_CLUSTER, &queue->queue_flags);

某些 SCSI 磁盘驱动程序以这种方式标记其请求队列,因为它们无法从请求合并中受益。

Some SCSI disk drivers mark their request queue in this way, since they do not benefit from the coalescing of requests.

不使用请求队列

Doing without a request queue

之前,我们讨论了内核为优化队列中请求的顺序所做的工作;这项工作涉及对请求进行排序,甚至可能会停止队列以允许预期的请求到达。这些技术有助于提高系统在处理真实的旋转磁盘驱动器时的性能。然而,对于像sbull这样的设备,它们完全被浪费了。许多面向块的设备,例如闪存阵列、数码相机中使用的媒体卡读卡器以及 RAM 磁盘,都具有真正的随机访问性能,并且无法从高级请求排队逻辑中受益。其他设备(例如软件 RAID 阵列或逻辑卷管理器创建的虚拟磁盘)不具有块层请求队列优化的性能特征。对于这种设备,最好直接从块层接受请求,根本不用理会请求队列。

Previously, we have discussed the work the kernel does to optimize the order of requests in the queue; this work involves sorting requests and, perhaps, even stalling the queue to allow an anticipated request to arrive. These techniques help the system's performance when dealing with a real, spinning disk drive. They are completely wasted, however, with a device like sbull. Many block-oriented devices, such as flash memory arrays, readers for media cards used in digital cameras, and RAM disks have truly random-access performance and do not benefit from advanced-request queueing logic. Other devices, such as software RAID arrays or virtual disks created by logical volume managers, do not have the performance characteristics for which the block layer's request queues are optimized. For this kind of device, it would be better to accept requests directly from the block layer and not bother with the request queue at all.

对于这些情况,块层支持“无队列”操作模式。要使用此模式,您的驱动程序必须提供“发出请求”函数,而不是 请求函数。make_request函数具有以下原型:

For these situations, the block layer supports a "no queue" mode of operation. To make use of this mode, your driver must provide a "make request" function, rather than a request function. The make_request function has this prototype:

typedef int (make_request_fn) (request_queue_t *q, struct bio *bio);
typedef int (make_request_fn) (request_queue_t *q, struct bio *bio);

请注意,请求队列仍然存在,即使它实际上永远不会容纳任何请求。make_request函数将一个结构体作为其主要参数bio,该结构体表示要传输的一个或多个缓冲区。make_request函数可以执行以下两种操作之一:它可以直接执行传输,也可以将请求重定向到另一个设备

Note that a request queue is still present, even though it will never actually hold any requests. The make_request function takes as its main parameter a bio structure, which represents one or more buffers to be transferred. The make_request function can do one of two things: it can either perform the transfer directly, or it can redirect the request to another device.

直接执行传输只需使用bio我们之前描述的访问器方法即可。然而,由于没有request可以使用的结构,您的函数应该bio通过调用 bio_endio直接向结构的创建者发出完成信号:

Performing the transfer directly is just a matter of working through the bio with the accessor methods we described earlier. Since there is no request structure to work with, however, your function should signal completion directly to the creator of the bio structure with a call to bio_endio:

void bio_endio(struct bio *bio, unsigned int bytes, int error);
void bio_endio(struct bio *bio, unsigned int bytes, int error);

这里bytes是到目前为止您已传输的字节数。可以小于整体表示的字节数bio;通过这种方式,您可以发出部分完成的信号,并更新bio. 您应该在设备进行进一步处理时再次调用bio_endio ,或者在无法完成请求时发出错误信号。通过为参数提供非零值来指示错误error;该值通常是一个错误代码,例如-EIO。无论 I/O 是否成功,make_request应该返回。0

Here, bytes is the number of bytes you have transferred so far. It can be less than the number of bytes represented by the bio as a whole; in this way, you can signal partial completion, and update the internal "current buffer" pointers within the bio. You should either call bio_endio again as your device makes further process, or signal an error if you are unable to complete the request. Errors are indicated by providing a nonzero value for the error parameter; this value is normally an error code such as -EIO. The make_request should return 0, regardless of whether the I/O is successful.

如果sbull加载了request_mode=2,它会使用make_request 函数进行操作。由于sbull已经有一个可以传输单个 的函数bio,因此make_request 函数很简单:

If sbull is loaded with request_mode=2, it operates with a make_request function. Since sbull already has a function that can transfer a single bio, the make_request function is simple:

static int sbull_make_request(request_queue_t *q, struct bio *bio)
{
    struct sbull_dev *dev = q->queuedata;
    int 状态;

    状态= sbull_xfer_bio(dev,bio);
    bio_endio(bio, bio->bi_size, 状态);
    返回0;
}
static int sbull_make_request(request_queue_t *q, struct bio *bio)
{
    struct sbull_dev *dev = q->queuedata;
    int status;

    status = sbull_xfer_bio(dev, bio);
    bio_endio(bio, bio->bi_size, status);
    return 0;
}

请注意,您永远不应该从常规 请求函数中调用bio_endio该作业由end_that_request_first处理 。

Please note that you should never call bio_endio from a regular request function; that job is handled by end_that_request_first instead.

某些块驱动程序(例如实现卷管理器和软件 RAID 阵列的块驱动程序)确实需要将请求重定向到处理实际 I/O 的另一个设备。编写这样的驱动程序超出了本书的范围。但我们注意到,如果make_request函数返回非零值,则会bio再次提交。因此,“堆栈”驱动程序可以修改该bi_bdev字段以指向不同的设备,更改起始扇区值,然后返回;然后块系统将其传递 bio给新设备。还有一个 bio_split调用可用于拆分bio分成多个块以提交到多个设备。尽管如果队列参数设置正确,则bio几乎不需要以这种方式分割 a 。

Some block drivers, such as those implementing volume managers and software RAID arrays, really need to redirect the request to another device that handles the actual I/O. Writing such a driver is beyond the scope of this book. We note, however, that if the make_request function returns a nonzero value, the bio is submitted again. A "stacking" driver can, therefore, modify the bi_bdev field to point to a different device, change the starting sector value, then return; the block system then passes the bio to the new device. There is also a bio_split call that can be used to split a bio into multiple chunks for submission to more than one device. Although if the queue parameters are set up correctly, splitting a bio in this way should almost never be necessary.

无论哪种方式,您都必须告诉块子系统您的驱动程序正在使用自定义 make_request函数。为此,您必须使用以下命令分配请求队列:

Either way, you must tell the block subsystem that your driver is using a custom make_request function. To do so, you must allocate a request queue with:

request_queue_t *blk_alloc_queue(int flags);
request_queue_t *blk_alloc_queue(int flags);

该函数与blk_init_queue 的不同之处在于它实际上并不设置队列来保存请求。参数flags 是一组分配标志,用于为队列分配内存;通常正确的值为GFP_KERNEL. 一旦你有了一个队列,将它和你的make_request函数传递给 blk_queue_make_request

This function differs from blk_init_queue in that it does not actually set up the queue to hold requests. The flags argument is a set of allocation flags to be used in allocating memory for the queue; usually the right value is GFP_KERNEL. Once you have a queue, pass it and your make_request function to blk_queue_make_request:

void blk_queue_make_request(request_queue_t *queue, make_request_fn *func);
void blk_queue_make_request(request_queue_t *queue, make_request_fn *func);

设置make_request函数的sbull代码 如下所示:

The sbull code to set up the make_request function looks like:

dev->队列 = blk_alloc_queue(GFP_KERNEL);
if (dev->queue == NULL)
    转到out_vfree;
blk_queue_make_request(dev->队列, sbull_make_request);
dev->queue = blk_alloc_queue(GFP_KERNEL);
if (dev->queue =  = NULL)
    goto out_vfree;
blk_queue_make_request(dev->queue, sbull_make_request);

出于好奇,花一些时间挖掘drivers/block/ll_rw_block.c显示所有队列都有一个 make_request函数。默认版本 generic_make_request处理将 合并到结构biorequest 。通过提供自己的make_request函数,驱动程序实际上只是重写特定的请求队列方法,并且 整理大部分工作。

For the curious, some time spent digging through drivers/block/ll_rw_block.c shows that all queues have a make_request function. The default version, generic_make_request, handles the incorporation of the bio into a request structure. By providing a make_request function of its own, a driver is really just overriding a specific request queue method and sorting out much of the work.

其他一些细节

Some Other Details

本节介绍了高级驱动程序可能感兴趣的块层的其他一些方面。编写正确的驱动程序不需要使用以下工具,但它们在某些情况下可能会有所帮助。

This section covers a few other aspects of the block layer that may be of interest for advanced drivers. None of the following facilities need to be used to write a correct driver, but they may be helpful in some situations.

命令准备

Command Pre-Preparation

块层提供了 驱动程序在从elv_next_request返回请求之前检查和预处理请求的机制 。这种机制允许驱动程序提前设置实际的驱动命令,决定是否可以处理请求,或执行其他类型的内务处理。

The block layer provides a mechanism for drivers to examine and preprocess requests before they are returned from elv_next_request. This mechanism allows drivers to set up the actual drive commands ahead of time, decide whether the request can be handled at all, or perform other sorts of housekeeping.

如果您想使用此功能,请创建一个适合此原型的命令准备函数:

If you want to use this feature, create a command preparation function that fits this prototype:

typedef int (prep_rq_fn) (request_queue_t *queue, struct request *req);
typedef int (prep_rq_fn) (request_queue_t *queue, struct request *req);

request结构包括一个名为 cmd ,这是一个BLK_MAX_CDB字节数组;准备函数可以使用该数组来存储实际的硬件命令(或任何其他有用的信息)。该函数应返回以下值之一:

The request structure includes a field called cmd , which is an array of BLK_MAX_CDB bytes; this array may be used by the preparation function to store the actual hardware command (or any other useful information). This function should return one of the following values:

BLKPREP_OK
BLKPREP_OK

命令准备工作正常,并且可以将请求传递给驱动程序的请求函数。

Command preparation went normally, and the request can be handed to your driver's request function.

BLKPREP_KILL
BLKPREP_KILL

该请求无法完成;它失败并出现错误代码。

This request cannot be completed; it is failed with an error code.

BLKPREP_DEFER
BLKPREP_DEFER

目前无法完成此请求。它保留在队列的前面,但不会交给请求函数。

This request cannot be completed at this time. It stays at the front of the queue but is not handed to the request function.

准备函数被调用 elv_next_request紧接在请求返回给您的驱动程序之前。如果此函数返回,则从elv_next_request到驱动程序的BLKPREP_DEFER返回值为。例如,如果您的设备已达到其可处理的最大请求数,则此操作模式会很有用。NULL

The preparation function is called by elv_next_request immediately before the request is returned to your driver. If this function returns BLKPREP_DEFER, the return value from elv_next_request to your driver is NULL. This mode of operation can be useful if, for example, your device has reached the maximum number of requests it can have outstanding.

让块层调用你的准备函数,将其传递给:

To have the block layer call your preparation function, pass it to:

void blk_queue_prep_rq(request_queue_t *queue, prep_rq_fn *func);
void blk_queue_prep_rq(request_queue_t *queue, prep_rq_fn *func);

默认情况下,请求队列没有准备功能。

By default, request queues have no preparation function.

标记命令队列

Tagged Command Queueing

可以有的硬件 多个同时活动的请求通常支持某种形式的标记命令队列 (TCQ)。TCQ 只是一种将整数“标签”附加到每个请求的技术,以便当驱动器完成其中一个请求时,它可以告诉驱动程序是哪一个。在以前版本的内核中,实现 TCQ 的块驱动程序必须自己完成所有工作;在 2.6 中,TCQ 支持基础设施已添加到块层以供所有驱动程序使用。

Hardware that can have multiple requests active at once usually supports some form of tagged command queueing (TCQ). TCQ is simply the technique of attaching an integer "tag" to each request so that when the drive completes one of those requests, it can tell the driver which one. In previous versions of the kernel, block drivers that implemented TCQ had to do all of the work themselves; in 2.6, a TCQ support infrastructure has been added to the block layer for all drivers to use.

如果您的驱动器执行标记命令排队,您应该在初始化时通过调用以下命令通知内核该事实:

If your drive performs tagged command queueing, you should inform the kernel of that fact at initialization time with a call to:

int blk_queue_init_tags(request_queue_t *队列, int 深度,
                        结构 blk_queue_tag *tags);
int blk_queue_init_tags(request_queue_t *queue, int depth, 
                        struct blk_queue_tag *tags);

这里,queue是您的请求队列,depth是您的设备在任何给定时间可以处理的标记请求的数量。tags是一个可选的指向struct blk_queue_tag结构数组的指针;一定有depth他们。通常,tags可以作为 传递NULL,并且blk_queue_init_tags分配数组。queue_tags但是,如果您需要在多个设备之间共享相同的标签,则可以从另一个请求队列传递标签数组指针(存储在该字段中)。您永远不应该自己实际分配tags数组;块层需要初始化数组,不将初始化函数导出到模块。

Here, queue is your request queue, and depth is the number of tagged requests your device can have outstanding at any given time. tags is an optional pointer to an array of struct blk_queue_tag structures; there must be depth of them. Normally, tags can be passed as NULL, and blk_queue_init_tags allocates the array. If, however, you need to share the same tags between multiple devices, you can pass the tags array pointer (stored in the queue_tags field) from another request queue. You should never actually allocate the tags array yourself; the block layer needs to initialize the array and does not export the initialization function to modules.

由于blk_queue_init_tags分配内存,因此可能会失败;在这种情况下,它会向调用者返回一个负错误代码。

Since blk_queue_init_tags allocates memory, it can fail; it returns a negative error code to the caller in that case.

如果您的设备可以处理的标签数量发生变化,您可以通过以下方式通知内核:

If the number of tags your device can handle changes, you can inform the kernel with:

int blk_queue_resize_tags(request_queue_t *queue, int new_depth);
int blk_queue_resize_tags(request_queue_t *queue, int new_depth);

呼叫期间必须保持队列锁。此调用可能会失败,在这种情况下返回负错误代码。

The queue lock must be held during the call. This call can fail, returning a negative error code in that case.

标记与结构的关联request是通过blk_queue_start_tag完成的,必须在持有队列锁的情况下调用它:

The association of a tag with a request structure is done with blk_queue_start_tag, which must be called with the queue lock held:

int blk_queue_start_tag(request_queue_t *queue, struct request *req);
int blk_queue_start_tag(request_queue_t *queue, struct request *req);

如果标签可用,该函数会为此请求分配该标签,将标签号存储在 中req->tag,然后返回0。它还使请求从队列中出列,并将其链接到自己的标记跟踪结构中,因此您的驱动程序应注意,如果使用标记,则不要使请求本身出列。如果没有更多可用标签, blk_queue_start_tag会将请求留在队列中并返回非零值。

If a tag is available, this function allocates it for this request, stores the tag number in req->tag, and returns 0. It also dequeues the request from the queue and links it into its own tag-tracking structure, so your driver should take care not to dequeue the request itself if it's using tags. If no more tags are available, blk_queue_start_tag leaves the request on the queue and returns a nonzero value.

当给定请求的所有传输完成后,您的驱动程序应返回带有以下内容的标签:

When all transfers for a given request have been completed, your driver should return the tag with:

void blk_queue_end_tag(request_queue_t *queue, struct request *req);
void blk_queue_end_tag(request_queue_t *queue, struct request *req);

再次强调,在调用此函数之前,您必须持有队列锁。该调用应该在end_that_request_first返回之后0(意味着请求完成)但在调用 end_that_request_last之前进行。请记住,请求已经出队,因此您的驱动程序此时这样做将是错误的。

Once again, you must hold the queue lock before calling this function. The call should be made after end_that_request_first returns 0 (meaning that the request is complete) but before calling end_that_request_last. Remember that the request is already dequeued, so it would be a mistake for your driver to do so at this point.

如果您需要查找与给定标记关联的请求(例如,当驱动器报告完成时),请使用blk_queue_find_tag

If you need to find the request associated with a given tag (when the drive reports completion, for example), use blk_queue_find_tag:

结构请求 *blk_queue_find_tag(request_queue_t *qeue, int tag);
struct request *blk_queue_find_tag(request_queue_t *qeue, int tag);

返回值是关联的request 结构,除非确实出现错误。

The return value is the associated request structure, unless something has gone truly wrong.

如果确实出现问题,您的驱动程序可能会发现自己必须重置或对其设备之一执行其他一些暴力行为。在这种情况下,任何未完成的标记命令都将无法完成。块层提供了一个可以帮助在这种情况下进行恢复工作的功能:

If things really do go wrong, your driver may find itself having to reset or perform some other act of violence against one of its devices. In that case, any outstanding tagged commands will not be completed. The block layer provides a function that can help with the recovery effort in such situations:

无效blk_queue_invalidate_tags(request_queue_t *队列);
void blk_queue_invalidate_tags(request_queue_t *queue);

该函数将所有未完成的标签返回到池中,并将关联的请求放回请求中 队列。调用此函数时必须持有队列锁。

This function returns all outstanding tags to the pool and puts the associated requests back into the request queue. The queue lock must be held when you call this function.

快速参考

Quick Reference

#include <linux/fs.h>

int register_blkdev(unsigned int major, const char *name);

int unregister_blkdev(unsigned int major, const char *name);
#include <linux/fs.h>

int register_blkdev(unsigned int major, const char *name);

int unregister_blkdev(unsigned int major, const char *name);

寄存器_blkdev 向内核注册块驱动程序,并可选择获取主设备号。可以使用unregister_blkdev取消注册驱动程序 。

register_blkdev registers a block driver with the kernel and, optionally, obtains a major number. A driver can be unregistered with unregister_blkdev.

struct block_device_operations
struct block_device_operations

包含块驱动程序的大部分方法的结构。

Structure that holds most of the methods for block drivers.

#include <linux/genhd.h>

struct gendisk;
#include <linux/genhd.h>

struct gendisk;

描述内核中单个块设备的结构。

Structure that describes a single block device within the kernel.

struct gendisk *alloc_disk(int minors);

void add_disk(struct gendisk *gd);
struct gendisk *alloc_disk(int minors);

void add_disk(struct gendisk *gd);

分配gendisk结构并将其返回到系统的函数。

Functions that allocate gendisk structures and return them to the system.

void set_capacity(struct gendisk *gd, sector_t sectors);
void set_capacity(struct gendisk *gd, sector_t sectors);

在结构中存储设备的容量(以 512 字节扇区为单位)gendisk

Stores the capacity of the device (in 512-byte sectors) within the gendisk structure.

void add_disk(struct gendisk *gd);
void add_disk(struct gendisk *gd);

将磁盘添加到内核。一旦调用此函数,内核就可以调用磁盘的方法。

Adds a disk to the kernel. As soon as this function is called, your disk's methods can be invoked by the kernel.

int check_disk_change(struct block_device *bdev);
int check_disk_change(struct block_device *bdev);

一个内核函数,用于检查给定磁盘驱动器中的介质更改,并在检测到此类更改时采取所需的清理操作。

A kernel function that checks for a media change in the given disk drive and takes the required cleanup action when such a change is detected.

#include <linux/blkdev.h>

request_queue_t blk_init_queue(request_fn_proc *request, spinlock_t *lock);

void blk_cleanup_queue(request_queue_t *);
#include <linux/blkdev.h>

request_queue_t blk_init_queue(request_fn_proc *request, spinlock_t *lock);

void blk_cleanup_queue(request_queue_t *);

处理块请求队列的创建和删除的函数。

Functions that handle the creation and deletion of block request queues.

struct request *elv_next_request(request_queue_t *queue);

void end_request(struct request *req, int success);
struct request *elv_next_request(request_queue_t *queue);

void end_request(struct request *req, int success);

elv_next_request从请求队列中获取下一个请求;end_request可以在非常简单的驱动程序中使用来标记请求的完成(或部分)。

elv_next_request obtains the next request from a request queue; end_request may be used in very simple drivers to mark the completion of (or part of) a request.

void blkdev_dequeue_request(struct request *req);

void elv_requeue_request(request_queue_t *queue, struct request *req);
void blkdev_dequeue_request(struct request *req);

void elv_requeue_request(request_queue_t *queue, struct request *req);

从队列中删除请求并在必要时将其放回的函数。

Functions that remove a request from a queue and put it back on if necessary.

void blk_stop_queue(request_queue_t *queue);

void blk_start_queue(request_queue_t *queue);
void blk_stop_queue(request_queue_t *queue);

void blk_start_queue(request_queue_t *queue);

如果您需要阻止进一步调用请求方法,则调用blk_stop_queue即可。需要调用 blk_start_queue才能 再次调用您的请求方法。

If you need to prevent further calls to your request method, a call to blk_stop_queue does the trick. A call to blk_start_queue is necessary to cause your request method to be invoked again.

void blk_queue_bounce_limit(request_queue_t *queue, u64 dma_addr);

void blk_queue_max_sectors(request_queue_t *queue, unsigned short max);

void blk_queue_max_phys_segments(request_queue_t *queue, unsigned short max);

void blk_queue_max_hw_segments(request_queue_t *queue, unsigned short max);

void blk_queue_max_segment_size(request_queue_t *queue, unsigned int max);

blk_queue_segment_boundary(request_queue_t *queue, unsigned long mask);

void blk_queue_dma_alignment(request_queue_t *queue, int mask);

void blk_queue_hardsect_size(request_queue_t *queue, unsigned short max);
void blk_queue_bounce_limit(request_queue_t *queue, u64 dma_addr);

void blk_queue_max_sectors(request_queue_t *queue, unsigned short max);

void blk_queue_max_phys_segments(request_queue_t *queue, unsigned short max);

void blk_queue_max_hw_segments(request_queue_t *queue, unsigned short max);

void blk_queue_max_segment_size(request_queue_t *queue, unsigned int max);

blk_queue_segment_boundary(request_queue_t *queue, unsigned long mask);

void blk_queue_dma_alignment(request_queue_t *queue, int mask);

void blk_queue_hardsect_size(request_queue_t *queue, unsigned short max);

设置各种队列参数的函数,这些参数控制如何为特定设备创建请求;参数在第 16.3.3.3 节中描述。

Functions that set various queue parameters that control how requests are created for a particular device; the parameters are described in the Section 16.3.3.3.

#include <linux/bio.h>

struct bio;
#include <linux/bio.h>

struct bio;

表示块 I/O 请求的一部分的低级结构。

Low-level structure representing a portion of a block I/O request.

bio_sectors(struct bio *bio);

bio_data_dir(struct bio *bio);
bio_sectors(struct bio *bio);

bio_data_dir(struct bio *bio);

产生由结构描述的传输的大小和方向的两个宏bio

Two macros that yield the size and direction of a transfer described by a bio structure.

bio_for_each_segment(bvec, bio, segno);
bio_for_each_segment(bvec, bio, segno);

一种伪控制结构,用于循环构成 bio结构的各个段。

A pseudocontrol structure used to loop through the segments that make up a bio structure.

char *_ _bio_kmap_atomic(struct bio *bio, int i, enum km_type type);

void _ _bio_kunmap_atomic(char *buffer, enum km_type type);
char *_ _bio_kmap_atomic(struct bio *bio, int i, enum km_type type);

void _ _bio_kunmap_atomic(char *buffer, enum km_type type);

_ _bio_kmap_atomic可用于为bio结构内的给定段创建内核虚拟地址。必须使用__bio_kunmap_atomic撤消映射。

_ _bio_kmap_atomic may be used to create a kernel virtual address for a given segment within a bio structure. The mapping must be undone with _ _bio_kunmap_atomic.

struct page *bio_page(struct bio *bio);

int bio_offset(struct bio *bio);

int bio_cur_sectors(struct bio *bio);

char *bio_data(struct bio *bio);

char *bio_kmap_irq(struct bio *bio, unsigned long *flags);

void bio_kunmap_irq(char *buffer, unsigned long *flags);
struct page *bio_page(struct bio *bio);

int bio_offset(struct bio *bio);

int bio_cur_sectors(struct bio *bio);

char *bio_data(struct bio *bio);

char *bio_kmap_irq(struct bio *bio, unsigned long *flags);

void bio_kunmap_irq(char *buffer, unsigned long *flags);

一组访问器宏,提供对结构内“当前”段的访问 bio

A set of accessor macros that provide access to the "current" segment within a bio structure.

void blk_queue_ordered(request_queue_t *queue, int flag);

int blk_barrier_rq(struct request *req);
void blk_queue_ordered(request_queue_t *queue, int flag);

int blk_barrier_rq(struct request *req);

如果您的驱动程序实现了屏障请求(理应如此),请调用blk_queue_ordered 。如果当前请求是屏障请求,则宏blk_barrier_rq返回非零值。

Call blk_queue_ordered if your driver implements barrier requests—as it should. The macro blk_barrier_rq returns a nonzero value if the current request is a barrier request.

int blk_noretry_request(struct request *req);
int blk_noretry_request(struct request *req);

如果给定的请求不应因错误而重试,则此宏返回一个非零值。

This macro returns a nonzero value if the given request should not be retried on errors.

int end_that_request_first(struct request *req, int success, int count);

void end_that_request_last(struct request *req);
int end_that_request_first(struct request *req, int success, int count);

void end_that_request_last(struct request *req);

使用end_that_request_first指示块 I/O 请求的一部分已完成。当该函数返回时0,请求完成并应传递给 end_that_request_last

Use end_that_request_first to indicate completion of a portion of a block I/O request. When that function returns 0, the request is complete and should be passed to end_that_request_last.

rq_for_each_bio(bio, request)
rq_for_each_bio(bio, request)

另一个宏实现的控制结构;它会逐步执行bio构成请求的每个部分。

Another macro-implemented control structure; it steps through each bio that makes up a request.

int blk_rq_map_sg(request_queue_t *queue, struct request *req, struct

scatterlist *list);
int blk_rq_map_sg(request_queue_t *queue, struct request *req, struct

scatterlist *list);

使用映射给定 DMA 传输请求中的缓冲区所需的信息填充给定的分散列表。

Fills the given scatterlist with the information needed to map the buffers in the given request for a DMA transfer.

typedef int (make_request_fn) (request_queue_t *q, struct bio *bio);
typedef int (make_request_fn) (request_queue_t *q, struct bio *bio);

make_request函数的原型。

The prototype for the make_request function.

void bio_endio(struct bio *bio, unsigned int bytes, int error);
void bio_endio(struct bio *bio, unsigned int bytes, int error);

给定的信号完成bio仅当您的驱动程序通过make_requestbio函数直接从块层 获取时才应使用此函数。

Signal completion for a given bio. This function should be used only if your driver obtained the bio directly from the block layer via the make_request function.

request_queue_t *blk_alloc_queue(int flags);

void blk_queue_make_request(request_queue_t *queue, make_request_fn *func);
request_queue_t *blk_alloc_queue(int flags);

void blk_queue_make_request(request_queue_t *queue, make_request_fn *func);

使用blk_alloc_queue分配与自定义make_request函数一起使用的请求队列。该函数应该使用blk_queue_make_request设置。

Use blk_alloc_queue to allocate a request queue that is used with a custom make_request function. That function should be set with blk_queue_make_request.

typedef int (prep_rq_fn) (request_queue_t *queue, struct request *req);

void blk_queue_prep_rq(request_queue_t *queue, prep_rq_fn *func);
typedef int (prep_rq_fn) (request_queue_t *queue, struct request *req);

void blk_queue_prep_rq(request_queue_t *queue, prep_rq_fn *func);

命令准备函数的原型和设置函数,可用于在请求传递到请求函数之前准备必要的硬件 命令

The prototype and setup functions for a command preparation function, which can be used to prepare the necessary hardware command before the request is passed to your request function.

int blk_queue_init_tags(request_queue_t *queue, int depth, struct

blk_queue_tag *tags);

int blk_queue_resize_tags(request_queue_t *queue, int new_depth);

int blk_queue_start_tag(request_queue_t *queue, struct request *req);

void blk_queue_end_tag(request_queue_t *queue, struct request *req);

struct request *blk_queue_find_tag(request_queue_t *qeue, int tag);

void blk_queue_invalidate_tags(request_queue_t *queue);
int blk_queue_init_tags(request_queue_t *queue, int depth, struct

blk_queue_tag *tags);

int blk_queue_resize_tags(request_queue_t *queue, int new_depth);

int blk_queue_start_tag(request_queue_t *queue, struct request *req);

void blk_queue_end_tag(request_queue_t *queue, struct request *req);

struct request *blk_queue_find_tag(request_queue_t *qeue, int tag);

void blk_queue_invalidate_tags(request_queue_t *queue);

使用标记命令的驱动程序支持功能 排队。

Support functions for drivers using tagged command queueing.

第 17 章网络驱动程序

Chapter 17. Network Drivers

讨论完字符和块驱动程序后,我们现在准备进入网络世界。网络接口是 Linux 设备的第三类标准,本章介绍它们如何与内核的其余部分交互。

Having discussed char and block drivers, we are now ready to move on to the world of networking. Network interfaces are the third standard class of Linux devices, and this chapter describes how they interact with the rest of the kernel.

网络接口在系统中的作用 与已安装的块设备类似。块设备向内核注册其磁盘和方法,然后通过其请求函数根据请求“传输”和“接收”块 。类似地,网络接口必须在特定的内核数据结构中注册自身,以便在与外界交换数据包时被调用。

The role of a network interface within the system is similar to that of a mounted block device. A block device registers its disks and methods with the kernel, and then "transmits" and "receives" blocks on request, by means of its request function. Similarly, a network interface must register itself within specific kernel data structures in order to be invoked when packets are exchanged with the outside world.

已安装磁盘和数据包传送接口之间存在一些重要区别。首先,磁盘作为/dev目录中的特殊文件存在,而网络接口没有这样的入口点。普通的文件操作(读、写等)在应用于网络接口时没有意义,因此不可能对它们应用 Unix“一切都是文件”方法。因此,网络接口存在于它们自己的命名空间中并导出一组不同的操作。

There are a few important differences between mounted disks and packet-delivery interfaces. To begin with, a disk exists as a special file in the /dev directory, whereas a network interface has no such entry point. The normal file operations (read, write, and so on) do not make sense when applied to network interfaces, so it is not possible to apply the Unix "everything is a file" approach to them. Thus, network interfaces exist in their own namespace and export a different set of operations.

尽管您可能会反对应用程序在使用套接字时使用读写系统调用,但 这些调用作用于与接口不同的软件对象同一物理接口上可以多路复用数百个套接字。

Although you may object that applications use the read and write system calls when using sockets, those calls act on a software object that is distinct from the interface. Several hundred sockets can be multiplexed on the same physical interface.

但两者之间最重要的区别是块驱动程序仅响应来自内核的请求而运行,而网络驱动程序从外部异步接收数据包。因此,当块驱动程序被要求向内核发送缓冲区时,网络设备则要求将传入数据包推送到内核。网络驱动程序的内核接口是针对这种不同的操作模式而设计的。

But the most important difference between the two is that block drivers operate only in response to requests from the kernel, whereas network drivers receive packets asynchronously from the outside. Thus, while a block driver is asked to send a buffer toward the kernel, the network device asks to push incoming packets toward the kernel. The kernel interface for network drivers is designed for this different mode of operation.

网络驱动程序还必须准备好支持许多管理任务,例如设置地址、修改传输参数以及维护流量和错误统计数据。网络驱动程序的 API 反映了这种需求,因此看起来与我们迄今为止看到的接口有些不同。

Network drivers also have to be prepared to support a number of administrative tasks, such as setting addresses, modifying transmission parameters, and maintaining traffic and error statistics. The API for network drivers reflects this need and, therefore, looks somewhat different from the interfaces we have seen so far.

Linux 内核的网络子系统被设计为完全独立于协议。这适用于两种网络协议( 互联网协议 [IP] 与 IPX 或其他协议)和硬件协议(以太网与令牌环等)。网络驱动程序和内核之间的交互一次正确地处理一个网络数据包;这使得协议问题可以从驱动程序中巧妙地隐藏起来,并且物理传输也可以从协议中隐藏起来。

The network subsystem of the Linux kernel is designed to be completely protocol-independent. This applies to both networking protocols ( Internet protocol [IP] versus IPX or other protocols) and hardware protocols (Ethernet versus token ring, etc.). Interaction between a network driver and the kernel properly deals with one network packet at a time; this allows protocol issues to be hidden neatly from the driver and the physical transmission to be hidden from the protocol.

本章介绍网络接口如何与 Linux 内核的其余部分配合,并提供基于内存的模块化网络接口形式的示例,该接口称为(您猜对了)snull。为了简化讨论,该接口使用以太网硬件协议并传输IP数据包。从检查snull中获得的知识可以很容易地应用于 IP 以外的协议,并且编写非以太网驱动程序仅在与实际网络协议相关的微小细节上有所不同。

This chapter describes how the network interfaces fit in with the rest of the Linux kernel and provides examples in the form of a memory-based modularized network interface, which is called (you guessed it) snull. To simplify the discussion, the interface uses the Ethernet hardware protocol and transmits IP packets. The knowledge you acquire from examining snull can be readily applied to protocols other than IP, and writing a non-Ethernet driver is different only in tiny details related to the actual network protocol.

本章不讨论 IP 编号方案、网络协议或其他一般网络概念。驱动程序编写者(通常)不关心这些主题,并且不可能在不到几百页的时间内提供令人满意的网络技术概述。强烈建议感兴趣的读者参考其他描述网络问题的书籍。

This chapter doesn't talk about IP numbering schemes, network protocols, or other general networking concepts. Such topics are not (usually) of concern to the driver writer, and it's impossible to offer a satisfactory overview of networking technology in less than a few hundred pages. The interested reader is urged to refer to other books describing networking issues.

在讨论网络设备之前,需要先了解一下术语。网络世界使用术语“八位组”来指代一组八位,这通常是网络设备和协议理解的最小单位。在此上下文中几乎从未遇到术语“字节”。为了与标准用法保持一致,我们在讨论网络设备时将使用八位字节。

One note on terminology is called for before getting into network devices. The networking world uses the term octet to refer to a group of eight bits, which is generally the smallest unit understood by networking devices and protocols. The term byte is almost never encountered in this context. In keeping with standard usage, we will use octet when talking about networking devices.

“标题”一词也值得快速提及。标头是在数据包通过网络子系统的各个层时添加到数据包前面的一组字节(错误、八位字节)。当应用程序通过 TCP 套接字发送数据块时,网络子系统会将该数据分解为数据包,并在开始时放置一个 TCP 标头,描述每个数据包在流中的位置。然后,较低层在 TCP 标头前面放置一个 IP 标头,用于将数据包路由到其目的地。如果数据包通过类似以太网的介质移动,则由硬件解释的以太网标头将位于其余部分的前面。网络驱动程序不需要关心更高级别的标头(通常),但它们通常必须参与硬件级标头的创建。

The term "header" also merits a quick mention. A header is a set of bytes (err, octets) prepended to a packet as it is passed through the various layers of the networking subsystem. When an application sends a block of data through a TCP socket, the networking subsystem breaks that data up into packets and puts a TCP header, describing where each packet fits within the stream, at the beginning. The lower levels then put an IP header, used to route the packet to its destination, in front of the TCP header. If the packet moves over an Ethernet-like medium, an Ethernet header, interpreted by the hardware, goes in front of the rest. Network drivers need not concern themselves with higher-level headers (usually), but they often must be involved in the creation of the hardware-level header.

snull 是如何设计的

How snull Is Designed

本节讨论 导致snull网络接口诞生的设计理念。尽管此信息可能看起来没什么用处,但不理解它可能会导致在使用示例代码时出现问题。

This section discusses the design concepts that led to the snull network interface. Although this information might appear to be of marginal use, failing to understand it might lead to problems when you play with the sample code.

第一个也是最重要的设计决策是示例接口应保持独立于实际硬件,就像本书中使用的大多数示例代码一样。这个约束导致了类似于环回接口的东西。snull不是环回接口;但是,它模拟与真实远程主机的对话,以便更好地演示编写网络驱动程序的任务。Linux环回驱动程序实际上非常简单;它可以在drivers/net/loopback.c中找到。

The first, and most important, design decision was that the sample interfaces should remain independent of real hardware, just like most of the sample code used in this book. This constraint led to something that resembles the loopback interface. snull is not a loopback interface; however, it simulates conversations with real remote hosts in order to better demonstrate the task of writing a network driver. The Linux loopback driver is actually quite simple; it can be found in drivers/net/loopback.c.

snull的另一个特点是它仅支持 IP 流量。这是接口内部工作的结果 — snull必须查看内部并解释数据包以正确模拟一对硬件接口。真正的接口不依赖于正在传输的协议,并且 snull的这种限制不会影响本章中显示的代码片段。

Another feature of snull is that it supports only IP traffic. This is a consequence of the internal workings of the interface—snull has to look inside and interpret the packets to properly emulate a pair of hardware interfaces. Real interfaces don't depend on the protocol being transmitted, and this limitation of snull doesn't affect the fragments of code shown in this chapter.

分配 IP 号码

Assigning IP Numbers

snull模块创建 两个接口。这些接口与简单的环回不同,因为通过其中一个接口传输的任何内容都会环回到另一个接口,而不是自身。看起来您有两个外部链接,但实际上您的计算机正在自行回复。

The snull module creates two interfaces. These interfaces are different from a simple loopback, in that whatever you transmit through one of the interfaces loops back to the other one, not to itself. It looks like you have two external links, but actually your computer is replying to itself.

不幸的是,这种效果不能单独通过 IP 号分配来实现,因为内核不会通过接口 A 发送定向到其自己的接口 B 的数据包。相反,它将使用环回通道而不通过 snull。为了能够通过 snull接口建立通信,需要在数据传输过程中修改源地址和目标地址。换句话说,通过其中一个接口发送的数据包应该被另一个接口接收,但传出数据包的接收者不应被识别为本地主机。接收到的数据包的源地址也是如此。

Unfortunately, this effect can't be accomplished through IP number assignments alone, because the kernel wouldn't send out a packet through interface A that was directed to its own interface B. Instead, it would use the loopback channel without passing through snull. To be able to establish a communication through the snull interfaces, the source and destination addresses need to be modified during data transmission. In other words, packets sent through one of the interfaces should be received by the other, but the receiver of the outgoing packet shouldn't be recognized as the local host. The same applies to the source address of received packets.

为了实现这种“隐藏环回”,snull接口会切换源地址和目标地址的第三个八位字节的最低有效位;也就是说,它同时更改了 C 类 IP 号码的网络号和主机号。最终效果是发送到网络 A(连接到sn0第一个接口)的数据包在该sn1接口上显示为属于网络 B 的数据包。

To achieve this kind of "hidden loopback," the snull interface toggles the least significant bit of the third octet of both the source and destination addresses; that is, it changes both the network number and the host number of class C IP numbers. The net effect is that packets sent to network A (connected to sn0, the first interface) appear on the sn1 interface as packets belonging to network B.

为了避免处理太多数字,让我们分配符号所涉及的 IP 号码的名称:

To avoid dealing with too many numbers, let's assign symbolic names to the IP numbers involved:

  • snullnet0 是连接到接口的网络sn0。同样,snullnet1网络是否连接到sn1. 这些网络的地址应该仅在第三个八位位组的最低有效位上有所不同。这些网络必须具有 24 位网络掩码。

  • snullnet0 is the network that is connected to the sn0 interface. Similarly, snullnet1 is the network connected to sn1. The addresses of these networks should differ only in the least significant bit of the third octet. These networks must have 24-bit netmasks.

  • local0 是分配给接口的IP地址sn0;它属于snullnet0. 关联的地址sn1local1local0并且local1第三个八位位组和第四个八位位组的最低有效位必须不同。

  • local0 is the IP address assigned to the sn0 interface; it belongs to snullnet0. The address associated with sn1 is local1. local0 and local1 must differ in the least significant bit of their third octet and in the fourth octet.

  • remote0 是 中的主机snullnet0,其第四个八位字节与 的相同local1。发送到的任何数据包在其网络地址已被接口代码修改后remote0到达。local1主机remote1属于snullnet1,其第四个八位字节与 的相同local0

  • remote0 is a host in snullnet0, and its fourth octet is the same as that of local1. Any packet sent to remote0 reaches local1 after its network address has been modified by the interface code. The host remote1 belongs to snullnet1, and its fourth octet is the same as that of local0.

的运作 snull接口如图 17-1所示,其中与每个接口关联的主机名打印在接口名称附近。

The operation of the snull interfaces is depicted in Figure 17-1, in which the hostname associated with each interface is printed near the interface name.

主机如何看待其接口

图 17-1。主机如何看待其接口

Figure 17-1. How a host sees its interfaces

以下是网络号的可能值。将这些行放入/etc/networks后,您就可以按名称调用您的网络。这些值是从保留供私人使用的数字范围中选择的。

Here are possible values for the network numbers. Once you put these lines in /etc/networks, you can call your networks by name. The values were chosen from the range of numbers reserved for private use.

snullnet0 192.168.0.0
snullnet1 192.168.1.0
snullnet0       192.168.0.0
snullnet1       192.168.1.0

以下是可以放入/etc/hosts的主机号:

The following are possible host numbers to put into /etc/hosts:

192.168.0.1 本地0
192.168.0.2 远程0
192.168.1.2本地1
192.168.1.1 远程1
192.168.0.1   local0
192.168.0.2   remote0
192.168.1.2   local1
192.168.1.1   remote1

这些数字的重要特征是 的主机部分与 的主机部分local0相同remote1,并且 的主机部分local1与 的主机部分相同remote0。只要这种关系适用,您就可以使用完全不同的数字。

The important feature of these numbers is that the host portion of local0 is the same as that of remote1, and the host portion of local1 is the same as that of remote0. You can use completely different numbers as long as this relationship applies.

但是要小心,如果您的计算机已连接到网络。您选择的号码可能是真实的 Internet 或 Intranet 号码,将它们分配给您的接口会阻止与真实主机的通信。例如,虽然刚刚显示的号码不是可路由的互联网号码,但它们可能已被您的专用网络使用。

Be careful, however, if your computer is already connected to a network. The numbers you choose might be real Internet or intranet numbers, and assigning them to your interfaces prevents communication with the real hosts. For example, although the numbers just shown are not routable Internet numbers, they could already be used by your private network.

无论您选择什么数字,您都可以通过发出以下命令来正确设置操作接口:

Whatever numbers you choose, you can correctly set up the interfaces for operation by issuing the following commands:

ifconfig sn0 本地0
ifconfig sn1 本地1
ifconfig sn0 local0
ifconfig sn1 local1

netmask 255.255.255.0 如果选择的地址范围不是 C 类范围,您可能需要添加该参数。

You may need to add the netmask 255.255.255.0 parameter if the address range chosen is not a class C range.

此时,可以到达接口的“远程”端。以下屏幕转储显示主机如何到达remote0remote1通过snull 接口:

At this point, the "remote" end of the interface can be reached. The following screendump shows how a host reaches remote0 and remote1 through the snull interface:

莫甘娜%ping -c 2 remote0
来自 192.168.0.99 的 64 个字节:icmp_seq=0 ttl=64 time=1.6 ms
来自 192.168.0.99 的 64 个字节:icmp_seq=1 ttl=64 时间=0.9 ms
发送 2 个数据包,接收 2 个数据包,0% 数据包丢失

莫甘娜%ping -c 2 remote1
来自 192.168.1.88 的 64 字节:icmp_seq=0 ttl=64 时间=1.8 ms
来自 192.168.1.88 的 64 字节:icmp_seq=1 ttl=64 时间=0.9 ms
发送 2 个数据包,接收 2 个数据包,0% 数据包丢失
morgana% ping -c 2 remote0
64 bytes from 192.168.0.99: icmp_seq=0 ttl=64 time=1.6 ms
64 bytes from 192.168.0.99: icmp_seq=1 ttl=64 time=0.9 ms
2 packets transmitted, 2 packets received, 0% packet loss

morgana% ping -c 2 remote1
64 bytes from 192.168.1.88: icmp_seq=0 ttl=64 time=1.8 ms
64 bytes from 192.168.1.88: icmp_seq=1 ttl=64 time=0.9 ms
2 packets transmitted, 2 packets received, 0% packet loss

请注意,您将无法到达属于两个网络的任何其他“主机”,因为在修改地址并收到数据包后,数据包将被您的计算机丢弃。例如,针对 192.168.0.32 的数据包将离开sn0并重新出现sn1,目标地址为 192.168.1.32,该地址不是主机的本地地址。

Note that you won't be able to reach any other "host" belonging to the two networks, because the packets are discarded by your computer after the address has been modified and the packet has been received. For example, a packet aimed at 192.168.0.32 will leave through sn0 and reappear at sn1 with a destination address of 192.168.1.32, which is not a local address for the host computer.

数据包的物理传输

The Physical Transport of Packets

作为就数据传输而言,snull接口属于以太网类。

As far as data transport is concerned, the snull interfaces belong to the Ethernet class.

snull模拟以太网,因为绝大多数现有网络(至少是工作站连接的网段)都基于以太网技术,无论是 10base-T、100base-T 还是千兆位。此外,内核还为以太网设备提供了一些通用支持,没有理由不使用它。作为以太网设备的优势是如此强大,甚至plip 接口(使用打印机端口的接口)也将自己声明为以太网设备。

snull emulates Ethernet because the vast majority of existing networks—at least the segments that a workstation connects to—are based on Ethernet technology, be it 10base-T, 100base-T, or Gigabit. Additionally, the kernel offers some generalized support for Ethernet devices, and there's no reason not to use it. The advantage of being an Ethernet device is so strong that even the plip interface (the interface that uses the printer ports) declares itself as an Ethernet device.

使用snull的以太网设置的最后一个优点是您可以 在接口上运行tcpdump以查看数据包的经过。使用tcpdump观察接口是了解两个接口如何工作的有用方法。

The last advantage of using the Ethernet setup for snull is that you can run tcpdump on the interface to see the packets go by. Watching the interfaces with tcpdump can be a useful way to see how the two interfaces work.

如前所述,snull仅适用于 IP 数据包。这种限制是由于snull会监听数据包甚至修改它们以使代码正常工作。该代码修改每个数据包的 IP 标头中的源、目的地和校验和,而不检查它是否实际传送 IP 信息。这种快速而肮脏的数据修改会破坏非 IP 数据包。如果您想交付其他协议 通过snull,您必须修改模块的源代码。

As was mentioned previously, snull works only with IP packets. This limitation is a result of the fact that snull snoops in the packets and even modifies them, in order for the code to work. The code modifies the source, destination, and checksum in the IP header of each packet without checking whether it actually conveys IP information. This quick-and-dirty data modification destroys non-IP packets. If you want to deliver other protocols through snull, you must modify the module's source code.

连接到内核

Connecting to the Kernel

我们开始关注 通过剖析snull源 来了解网络驱动程序的结构。将多个驱动程序的源代码放在手边可能会帮助您了解讨论并了解实际的 Linux 网络驱动程序如何运行。作为起点,我们建议按复杂性递增的顺序,使用Loopback.cplip.ce100.c 。所有这些文件都位于内核源代码树中的drivers/net中。

We start looking at the structure of network drivers by dissecting the snull source. Keeping the source code for several drivers handy might help you follow the discussion and to see how real-world Linux network drivers operate. As a place to start, we suggest loopback.c, plip.c, and e100.c, in order of increasing complexity. All these files live in drivers/net, within the kernel source tree.

设备注册

Device Registration

当驱动程序模块加载到 运行内核,它请求资源并提供设施;这并没有什么新鲜事。请求资源的方式也没有什么新内容。驱动程序应该探测其设备及其硬件位置(I/O 端口和 IRQ 线),但不注册它们,如第10.2 节中所述。网络驱动程序通过其模块初始化函数注册的方式与字符和块驱动程序不同。由于网络接口没有等效的主编号和次编号,因此网络驱动程序不会请求这样的编号。相反,驱动程序将每个新检测到的接口的数据结构插入到网络设备的全局列表中。

When a driver module is loaded into a running kernel, it requests resources and offers facilities; there's nothing new in that. And there's also nothing new in the way resources are requested. The driver should probe for its device and its hardware location (I/O ports and IRQ line)—but not register them—as described in Section 10.2. The way a network driver is registered by its module initialization function is different from char and block drivers. Since there is no equivalent of major and minor numbers for network interfaces, a network driver does not request such a number. Instead, the driver inserts a data structure for each newly detected interface into a global list of network devices.

每个接口都由一个项目描述,该项目在<linux/netdevice.h>struct net_device中定义。snull驱动程序 将指向其中两个结构( for 和)的指针保存在一个简单的数组中:sn0sn1

Each interface is described by a struct net_device item, which is defined in <linux/netdevice.h>. The snull driver keeps pointers to two of these structures (for sn0 and sn1) in a simple array:

结构net_device *snull_devs[2];
struct net_device *snull_devs[2];

与许多其他内核结构一样,该net_device结构包含一个 kobject,因此通过 sysfs 进行引用计数和导出。与其他此类结构一样,它必须动态分配。用于执行此分配的内核函数是alloc_netdev,其原型如下:

The net_device structure, like many other kernel structures, contains a kobject and is, therefore, reference-counted and exported via sysfs. As with other such structures, it must be allocated dynamically. The kernel function provided to perform this allocation is alloc_netdev, which has the following prototype:

结构 net_device *alloc_netdev(int sizeof_priv,
                                常量字符*名称,
                                void (*setup)(struct net_device *));
struct net_device *alloc_netdev(int sizeof_priv, 
                                const char *name,
                                void (*setup)(struct net_device *));

这里,sizeof_priv是驱动程序的“私有数据”区域的大小;对于网络设备,该区域是与net_device结构一起分配的。事实上,两者一起分配在一大块内存中,但驱动程序作者应该假装他们不知道这一点。 name是该接口的名称,如用户空间所见;该名称中可以包含printf样式。%d内核将替换%d为下一个可用的接口号。最后,setup是一个指向初始化函数的指针,调用该函数来设置其余部分 net_device结构。我们很快就会看到初始化函数,但是现在只需说snull 分配了它的两个设备结构如下:

Here, sizeof_priv is the size of the driver's "private data" area; with network devices, that area is allocated along with the net_device structure. In fact, the two are allocated together in one large chunk of memory, but driver authors should pretend that they don't know that. name is the name of this interface, as is seen by user space; this name can have a printf-style %d in it. The kernel replaces the %d with the next available interface number. Finally, setup is a pointer to an initialization function that is called to set up the rest of the net_device structure. We get to the initialization function shortly, but, for now, suffice it to say that snull allocates its two device structures in this way:

snull_devs[0] = alloc_netdev(sizeof(struct snull_priv), "sn%d",
        snull_init);
snull_devs[1] = alloc_netdev(sizeof(struct snull_priv), "sn%d",
        snull_init);
if (snull_devs[0] == NULL || snull_devs[1] == NULL)
    转到出去;
snull_devs[0] = alloc_netdev(sizeof(struct snull_priv), "sn%d",
        snull_init);
snull_devs[1] = alloc_netdev(sizeof(struct snull_priv), "sn%d",
        snull_init);
if (snull_devs[0] =  = NULL || snull_devs[1] =  = NULL)
    goto out;

与往常一样,我们必须检查返回值以确保分配成功。

As always, we must check the return value to ensure that the allocation succeeded.

网络子系统为各种类型的接口提供了许多围绕 alloc_netdev 的辅助函数。最常见的是 alloc_etherdev ,它在<linux/etherdevice.h>中定义 :

The networking subsystem provides a number of helper functions wrapped around alloc_netdev for various types of interfaces. The most common is alloc_etherdev, which is defined in <linux/etherdevice.h>:

struct net_device *alloc_etherdev(int sizeof_priv);
struct net_device *alloc_etherdev(int sizeof_priv);

eth%d该函数分配一个用于名称参数的网络设备。它提供自己的初始化函数(ether_setup),net_device为以太网设备设置多个字段的适当值。因此,没有驱动程序为alloc_etherdev提供的初始化函数;驱动程序应该在成功分配后直接执行所需的初始化。其他类型设备的驱动程序编写者可能希望利用其他辅助函数之一,例如光纤通道设备的 alloc_fcdev (在<linux/fcdevice.h>中定义 )、 alloc_fddidev ( <linux/fddidevice.h>) 对于 FDDI 设备,或alloc_trdev ( <linux/trdevice.h> ) 对于令牌环设备。

This function allocates a network device using eth%d for the name argument. It provides its own initialization function (ether_setup) that sets several net_device fields with appropriate values for Ethernet devices. Thus, there is no driver-supplied initialization function for alloc_etherdev; the driver should simply do its required initialization directly after a successful allocation. Writers of drivers for other types of devices may want to take advantage of one of the other helper functions, such as alloc_fcdev (defined in <linux/fcdevice.h>) for fiber-channel devices, alloc_fddidev (<linux/fddidevice.h>) for FDDI devices, or alloc_trdev (<linux/trdevice.h>) for token ring devices.

snull可以毫无问题地使用alloc_etherdev ;我们选择使用alloc_netdev来代替,作为演示较低级别接口并让我们控制分配给接口的名称的一种方式。

snull could use alloc_etherdev without trouble; we chose to use alloc_netdev instead, as a way of demonstrating the lower-level interface and to give us control over the name assigned to the interface.

一旦net_device结构已经初始化,完成该过程只需将结构传递给register_netdev即可。在 snull中,调用如下所示:

Once the net_device structure has been initialized, completing the process is just a matter of passing the structure to register_netdev. In snull, the call looks as follows:

对于 (i = 0; i < 2; i++)
    if ((结果 = register_netdev(snull_devs[i])))
        printk("snull: 错误 %i 注册设备 \"%s\"\n",
                结果,snull_devs[i]->name);
for (i = 0; i < 2;  i++)
    if ((result = register_netdev(snull_devs[i])))
        printk("snull: error %i registering device \"%s\"\n",
                result, snull_devs[i]->name);

通常的注意事项适用于此:一旦您调用 register_netdev,您的驱动程序可能会被调用以在设备上进行操作。因此,在一切都完全初始化之前,您不应该注册设备。

The usual cautions apply here: as soon as you call register_netdev, your driver may be called to operate on the device. Thus, you should not register the device until everything has been completely initialized.

初始化每个设备

Initializing Each Device

我们查看了分配情况 net_device结构的注册,但是我们跳过了完全初始化该结构的中间步骤。请注意,struct net_device总是在运行时放在一起;它不能像file_operationsorblock_device_operations 结构一样在编译时设置。调用之前必须完成此初始化 注册网络设备。结构net_device 庞大、复杂;幸运的是,内核通过ether_setup处理一些以太网范围的默认设置 函数(由alloc_etherdev调用)。

We have looked at the allocation and registration of net_device structures, but we passed over the intermediate step of completely initializing that structure. Note that struct net_device is always put together at runtime; it cannot be set up at compile time in the same manner as a file_operations or block_device_operations structure. This initialization must be complete before calling register_netdev. The net_device structure is large and complicated; fortunately, the kernel takes care of some Ethernet-wide defaults through the ether_setup function (which is called by alloc_etherdev).

由于snull使用alloc_netdev ,它有一个单独的初始化函数。该函数( snull_init )的核心如下:

Since snull uses alloc_netdev , it has a separate initialization function. The core of this function (snull_init) is as follows:

以太设置(开发);/* 分配一些字段 */

dev->open = snull_open;
dev->stop = snull_release;
dev->set_config = snull_config;
dev->hard_start_xmit = snull_tx;
dev->do_ioctl = snull_ioctl;
dev->get_stats = snull_stats;
dev->rebuild_header = snull_rebuild_header;
dev->hard_header = snull_header;
dev->tx_timeout = snull_tx_timeout;
dev->watchdog_timeo = 超时;
/* 保留默认标志,只需添加 NOARP */
dev->flags |= IFF_NOARP;
开发->功能|= NETIF_F_NO_CSUM;
dev->hard_header_cache = NULL; /* 禁用缓存 */
ether_setup(dev); /* assign some of the fields */

dev->open            = snull_open;
dev->stop            = snull_release;
dev->set_config      = snull_config;
dev->hard_start_xmit = snull_tx;
dev->do_ioctl        = snull_ioctl;
dev->get_stats       = snull_stats;
dev->rebuild_header  = snull_rebuild_header;
dev->hard_header     = snull_header;
dev->tx_timeout      = snull_tx_timeout;
dev->watchdog_timeo = timeout;
/* keep the default flags, just add NOARP */
dev->flags           |= IFF_NOARP;
dev->features        |= NETIF_F_NO_CSUM;
dev->hard_header_cache = NULL;      /* Disable caching */

上面的代码是net_device结构体的相当常规的初始化;这主要是存储指向我们的各种驱动程序函数的指针的问题。代码的一个不寻常的功能是IFF_NOARP在标志中设置。这指定该接口不能使用地址解析协议 (ARP)。ARP是低层以太网协议;它的工作是将IP地址转换为以太网地址 媒体访问控制 (MAC) 地址。由于snull模拟的“远程”系统 实际上并不存在,因此没有人可以回答它们的 ARP 请求。我们选择将接口标记为无法处理该协议,而不是通过添加 ARP 实现来使snull变得复杂。对 的分配hard_header_cache也有类似的原因:它禁用此接口上(不存在的)ARP 回复的缓存。该主题将在本章后面的17.11 节中详细讨论。

The above code is a fairly routine initialization of the net_device structure; it is mostly a matter of storing pointers to our various driver functions. The single unusual feature of the code is setting IFF_NOARP in the flags. This specifies that the interface cannot use the Address Resolution Protocol (ARP). ARP is a low-level Ethernet protocol; its job is to turn IP addresses into Ethernet medium access control (MAC) addresses. Since the "remote" systems simulated by snull do not really exist, there is nobody available to answer ARP requests for them. Rather than complicate snull with the addition of an ARP implementation, we chose to mark the interface as being unable to handle that protocol. The assignment to hard_header_cache is there for a similar reason: it disables the caching of the (nonexistent) ARP replies on this interface. This topic is discussed in detail in Section 17.11 later in this chapter.

初始化代码还设置了几个与传输超时处理相关的字段(tx_timeout和)。watchdog_timeo我们在第 17.5.2节中详细介绍了这个主题。

The initialization code also sets a couple of fields (tx_timeout and watchdog_timeo) that relate to the handling of transmission timeouts. We cover this topic thoroughly in the section Section 17.5.2.

我们现在看看另外一个struct net_device领域,priv. 它的作用类似于private_data我们用于字符驱动程序的指针。与 不同的是fops->private_data,该priv指针是与net_device结构一起分配的。priv出于性能和灵活性的原因,也不鼓励直接访问该字段。当驱动程序需要访问私有数据指针时,应使用netdev_priv函数。因此, snull驱动程序充满了如下声明:

We look now at one more struct net_device field, priv. Its role is similar to that of the private_data pointer that we used for char drivers. Unlike fops->private_data, this priv pointer is allocated along with the net_device structure. Direct access to the priv field is also discouraged, for performance and flexibility reasons. When a driver needs to get access to the private data pointer, it should use the netdev_priv function. Thus, the snull driver is full of declarations such as:

struct snull_priv *priv = netdev_priv(dev);
struct snull_priv *priv = netdev_priv(dev);

snull模块声明一个snull_priv数据结构用于priv

The snull module declares a snull_priv data structure to be used for priv:

结构 snull_priv {
    结构 net_device_stats 统计信息;
    int 状态;
    结构 snull_packet *ppool;
    结构 snull_packet *rx_queue; /* 传入数据包列表 */
    int rx_int_enabled;
    int tx_packetlen;
    u8 *tx_packetdata;
    结构体sk_buff *skb;
    spinlock_t 锁;
};
struct snull_priv {
    struct net_device_stats stats;
    int status;
    struct snull_packet *ppool;
    struct snull_packet *rx_queue;  /* List of incoming packets */
    int rx_int_enabled;
    int tx_packetlen;
    u8 *tx_packetdata;
    struct sk_buff *skb;
    spinlock_t lock;
};

除其他外,该结构还包括 的实例struct net_device_stats,它是保存接口统计信息的标准位置。snull_init中的以下几行分配和初始化dev->priv

The structure includes, among other things, an instance of struct net_device_stats, which is the standard place to hold interface statistics. The following lines in snull_init allocate and initialize dev->priv:

priv = netdev_priv(dev);
memset(priv, 0, sizeof(struct snull_priv));
spin_lock_init(&priv->lock);
snull_rx_ints(dev, 1); /* 使能接收中断 */
priv = netdev_priv(dev);
memset(priv, 0, sizeof(struct snull_priv));
spin_lock_init(&priv->lock);
snull_rx_ints(dev, 1);      /* enable receive interrupts */

模块卸载

Module Unloading

没有什么特别的事情发生 当模块被卸载时。模块清理函数只是取消注册接口,执行所需的任何内部清理,并将结构释放net_device回系统:

Nothing special happens when the module is unloaded. The module cleanup function simply unregisters the interfaces, performs whatever internal cleanup is required, and releases the net_device structure back to the system:

无效 snull_cleanup(无效)
{
    整数我;
    
    for (i = 0; i < 2; i++) {
        如果 (snull_devs[i]) {
            unregister_netdev(snull_devs[i]);
            snull_teardown_pool(snull_devs[i]);
            free_netdev(snull_devs[i]);
        }
    }
    返回;
}
void snull_cleanup(void)
{
    int i;
    
    for (i = 0; i < 2;  i++) {
        if (snull_devs[i]) {
            unregister_netdev(snull_devs[i]);
            snull_teardown_pool(snull_devs[i]);
            free_netdev(snull_devs[i]);
        }
    }
    return;
}

对unregister_netdev 的调用 从系统中删除该接口;free_netdev将结构返回 net_device给内核。如果某个地方存在对该结构的引用,它可能会继续存在,但您的驱动程序不需要关心这一点。一旦取消注册接口,内核就不再调用其方法。

The call to unregister_netdev removes the interface from the system; free_netdev returns the net_device structure to the kernel. If a reference to that structure exists somewhere, it may continue to exist, but your driver need not care about that. Once you have unregistered the interface, the kernel no longer calls its methods.

请注意,在设备注销之前,我们的内部清理(在snull_teardown_pool中完成)不会发生。然而,它必须在我们将net_device结构返回到系统之前发生;一旦我们调用了free_netdev ,我们无法进一步引用该设备或我们的私人区域。

Note that our internal cleanup (done in snull_teardown_pool) cannot happen until the device has been unregistered. It must, however, happen before we return the net_device structure to the system; once we have called free_netdev , we cannot make any further references to the device or our private area.

net_device 结构详细信息

The net_device Structure in Detail

net_device结构位于 网络驱动层的核心,值得完整描述。该列表描述了所有字段,但更多的是提供参考而不是记住。本章的其余部分将简要描述示例代码中使用的每个字段,因此您无需继续参考本节。

The net_device structure is at the very core of the network driver layer and deserves a complete description. This list describes all the fields, but more to provide a reference than to be memorized. The rest of this chapter briefly describes each field as soon as it is used in the sample code, so you don't need to keep referring back to this section.

全球资讯

Global Information

第一部分struct net_device由以下字段组成:

The first part of struct net_device is composed of the following fields:

char name[IFNAMSIZ];
char name[IFNAMSIZ];

设备的名称。如果驱动程序设置的名称包含%d格式字符串,则register_netdev 将其替换为数字以形成唯一的名称;分配的编号从 开始0

The name of the device. If the name set by the driver contains a %d format string, register_netdev replaces it with a number to make a unique name; assigned numbers start at 0.

unsigned long state;
unsigned long state;

设备 状态。该字段包括几个标志。驱动程序通常不会直接操作这些标志;相反,提供了一组实用函数。当我们进入驱动程序操作时,我们将很快讨论这些功能。

Device state. The field includes several flags. Drivers do not normally manipulate these flags directly; instead, a set of utility functions has been provided. These functions are discussed shortly when we get into driver operations.

struct net_device *next;
struct net_device *next;

指向下一个的指针全局链表中的设备。驾驶员不应触及该区域。

Pointer to the next device in the global linked list. This field shouldn't be touched by the driver.

int (*init)(struct net_device *dev);
int (*init)(struct net_device *dev);

初始化功能。如果这个指针设置后,由register_netdev调用该函数来完成该结构体的初始化 net_device。大多数现代网络驱动程序不再使用此功能;相反,初始化是在注册接口之前执行的。

An initialization function. If this pointer is set, the function is called by register_netdev to complete the initialization of the net_device structure. Most modern network drivers do not use this function any longer; instead, initialization is performed before registering the interface.

硬件信息

Hardware Information

以下字段包含相对简单设备的低级硬件信息。它们是早期 Linux 网络的遗留物;大多数现代驱动程序确实使用它们(可能例外if_port)。为了完整起见,我们在这里列出它们。

The following fields contain low-level hardware information for relatively simple devices. They are a holdover from the earlier days of Linux networking; most modern drivers do make use of them (with the possible exception of if_port). We list them here for completeness.

unsigned long rmem_end;

unsigned long rmem_start;

unsigned long mem_end;

unsigned long mem_start;
unsigned long rmem_end;

unsigned long rmem_start;

unsigned long mem_end;

unsigned long mem_start;

设备内存信息。这些字段保存设备使用的共享内存的起始和结束地址。如果设备具有不同的接收和发送存储器,则这些mem字段用于发送存储器,而这些rmem字段用于接收存储器。这些rmem字段永远不会在驱动程序本身之外被引用。按照惯例,这些end字段的设置是end - start可用板载内存的数量。

Device memory information. These fields hold the beginning and ending addresses of the shared memory used by the device. If the device has different receive and transmit memories, the mem fields are used for transmit memory and the rmem fields for receive memory. The rmem fields are never referenced outside of the driver itself. By convention, the end fields are set so that end - start is the amount of available onboard memory.

unsigned long base_addr;
unsigned long base_addr;

网络接口的 I/O 基地址。与前面的字段一样,该字段由驱动程序在设备探测期间分配。ifconfig命令 可用于显示或修改当前值。可以在系统引导时(通过参数)或模块加载时base_addr在内核命令行上显式分配。netdev=该字段与上述内存字段一样,不被内核使用。

The I/O base address of the network interface. This field, like the previous ones, is assigned by the driver during the device probe. The ifconfig command can be used to display or modify the current value. The base_addr can be explicitly assigned on the kernel command line at system boot (via the netdev= parameter) or at module load time. The field, like the memory fields described above, is not used by the kernel.

unsigned char irq;
unsigned char irq;

分配的中断号。列出接口时,ifconfigdev->irq将打印的值。该值通常可以在启动或加载时设置,并在以后使用ifconfig进行修改。

The assigned interrupt number. The value of dev->irq is printed by ifconfig when interfaces are listed. This value can usually be set at boot or load time and modified later using ifconfig.

unsigned char if_port;
unsigned char if_port;

多端口设备上使用的端口。例如,该字段用于支持同轴 ( IF_PORT_10BASE2) 和双绞线 ( IF_PORT_100BASET) 以太网连接的设备。全套已知端口类型在<linux/netdevice.h>中定义。

The port in use on multiport devices. This field is used, for example, with devices that support both coaxial (IF_PORT_10BASE2) and twisted-pair (IF_PORT_100BASET) Ethernet connections. The full set of known port types is defined in <linux/netdevice.h>.

unsigned char dma;
unsigned char dma;

设备分配的 DMA 通道。该字段仅对某些外设总线(例如 ISA)有意义。它不在设备驱动程序本身之外使用,而是用于提供信息 目的(在 ifconfig中)。

The DMA channel allocated by the device. The field makes sense only with some peripheral buses, such as ISA. It is not used outside of the device driver itself but for informational purposes (in ifconfig).

接口信息

Interface Information

有关接口的大部分信息均由ether_setup正确设置 功能(或任何其他适合给定硬件类型的设置功能)。以太网卡可以依赖此通用功能来处理大多数这些字段,但flagsdev_addr字段是特定于设备的,必须在初始化时显式分配。

Most of the information about the interface is correctly set up by the ether_setup function (or whatever other setup function is appropriate for the given hardware type). Ethernet cards can rely on this general-purpose function for most of these fields, but the flags and dev_addr fields are device specific and must be explicitly assigned at initialization time.

一些 非以太网接口可以使用类似于ether_setup 的辅助函数。drivers/net/net_init.c导出许多此类函数,包括以下内容:

Some non-Ethernet interfaces can use helper functions similar to ether_setup. drivers/net/net_init.c exports a number of such functions, including the following:

void ltalk_setup(struct net_device *dev);
void ltalk_setup(struct net_device *dev);

设置 LocalTalk 设备的字段

Sets up the fields for a LocalTalk device

void fc_setup(struct net_device *dev);
void fc_setup(struct net_device *dev);

初始化光纤通道设备的字段

Initializes fields for fiber-channel devices

void fddi_setup(struct net_device *dev);
void fddi_setup(struct net_device *dev);

配置光纤分布式数据接口 (FDDI) 网络的接口

Configures an interface for a Fiber Distributed Data Interface (FDDI) network

void hippi_setup(struct net_device *dev);
void hippi_setup(struct net_device *dev);

为高性能并行接口 (HIPPI) 高速互连驱动程序准备字段

Prepares fields for a High-Performance Parallel Interface (HIPPI) high-speed interconnect driver

void tr_setup(struct net_device *dev);
void tr_setup(struct net_device *dev);

处理令牌环网络接口的设置

Handles setup for token ring network interfaces

大多数设备都属于这些类别之一。但是,如果您的项目是全新且不同的,则需要手动分配以下字段:

Most devices are covered by one of these classes. If yours is something radically new and different, however, you need to assign the following fields by hand:

unsigned short hard_header_len;
unsigned short hard_header_len;

硬件标头长度,即在 IP 标头或其他协议信息之前引导传输数据包的八位字节数。hard_header_len对于以太网接口,其值为 14( ETH_HLEN)。

The hardware header length, that is, the number of octets that lead the transmitted packet before the IP header, or other protocol information. The value of hard_header_len is 14 (ETH_HLEN) for Ethernet interfaces.

unsigned mtu;
unsigned mtu;

最大传输单元 (MTU)。该字段被网络层用来驱动数据包传输。以太网的 MTU 为 1500 个八位位组 ( ETH_DATA_LEN)。可以使用ifconfig更改该值 。

The maximum transfer unit (MTU). This field is used by the network layer to drive packet transmission. Ethernet has an MTU of 1500 octets (ETH_DATA_LEN). This value can be changed with ifconfig.

unsigned long tx_queue_len;
unsigned long tx_queue_len;

可以在设备传输队列中排队的最大帧数。该值由ether_setup设置为 1000 ,但您可以更改它。例如,plip使用 10 以避免浪费系统内存(plip的吞吐量比真正的以太网接口低)。

The maximum number of frames that can be queued on the device's transmission queue. This value is set to 1000 by ether_setup, but you can change it. For example, plip uses 10 to avoid wasting system memory (plip has a lower throughput than a real Ethernet interface).

unsigned short type;
unsigned short type;

接口的硬件类型。ARP 使用该type 字段来确定接口支持哪种硬件地址。以太网接口的正确值为,即由ether_setupARPHRD_ETHER设置的值 。识别的类型在<linux/if_arp.h>中定义 。

The hardware type of the interface. The type field is used by ARP to determine what kind of hardware address the interface supports. The proper value for Ethernet interfaces is ARPHRD_ETHER, and that is the value set by ether_setup. The recognized types are defined in <linux/if_arp.h>.

unsigned char addr_len;

unsigned char broadcast[MAX_ADDR_LEN];

unsigned char dev_addr[MAX_ADDR_LEN];
unsigned char addr_len;

unsigned char broadcast[MAX_ADDR_LEN];

unsigned char dev_addr[MAX_ADDR_LEN];

硬件 (MAC) 地址长度和设备硬件地址。以太网地址长度为6个八位位组(我们指的是接口板的硬件ID),广播地址由6个八位位组组成0xffether_setup安排这些值是正确的。另一方面,设备地址必须以设备特定的方式从接口板读取,并且驱动程序应将其复制到dev_addr. 硬件地址用于在数据包移交给驱动程序进行传输之前生成正确的以太网标头。斯尼尔 _设备不使用物理接口,并且发明自己的硬件地址。

Hardware (MAC) address length and device hardware addresses. The Ethernet address length is six octets (we are referring to the hardware ID of the interface board), and the broadcast address is made up of six 0xff octets; ether_setup arranges for these values to be correct. The device address, on the other hand, must be read from the interface board in a device-specific way, and the driver should copy it to dev_addr. The hardware address is used to generate correct Ethernet headers before the packet is handed over to the driver for transmission. The snull device doesn't use a physical interface, and it invents its own hardware address.

unsigned short flags;

int features;
unsigned short flags;

int features;

接口标志(下面详细介绍)。

Interface flags (detailed next).

flags字段是包含以下位值的位掩码。前缀IFF_ 代表“接口标志”。有些标志由内核管理,有些标志由接口在初始化时设置,以断言接口的各种功能和其他特性。<linux/if.h>中定义的有效标志 是:

The flags field is a bit mask including the following bit values. The IFF_ prefix stands for "interface flags." Some flags are managed by the kernel, and some are set by the interface at initialization time to assert various capabilities and other features of the interface. The valid flags, which are defined in <linux/if.h>, are:

IFF_UP
IFF_UP

该标志对于驱动程序来说是只读的。当接口处于活动状态并准备好传输数据包时,内核将其打开。

This flag is read-only for the driver. The kernel turns it on when the interface is active and ready to transfer packets.

IFF_BROADCAST
IFF_BROADCAST

该标志(由网络代码维护)表明该接口允许广播。以太网板可以。

This flag (maintained by the networking code) states that the interface allows broadcasting. Ethernet boards do.

IFF_DEBUG
IFF_DEBUG

这标志着调试模式。该标志可用于控制 printk调用的详细程度或用于其他调试目的。尽管当前没有树内驱动程序使用此标志,但用户程序可以通过ioctl设置和重置它,并且您的驱动程序可以使用它。Misc-progs/netifdebug程序可用于打开和关闭该标志。

This marks debug mode. The flag can be used to control the verbosity of your printk calls or for other debugging purposes. Although no in-tree driver currently uses this flag, it can be set and reset by user programs via ioctl, and your driver can use it. The misc-progs/netifdebug program can be used to turn the flag on and off.

IFF_LOOPBACK
IFF_LOOPBACK

该标志只能在环回接口中设置。内核会检查 IFF_LOOPBACK该名称,而不是将其硬连线lo为特殊接口。

This flag should be set only in the loopback interface. The kernel checks for IFF_LOOPBACK instead of hardwiring the lo name as a special interface.

IFF_POINTOPOINT
IFF_POINTOPOINT

该标志表明接口已连接到点对点链路。它由驱动程序设置,有时由ifconfig设置。例如, plip和 PPP 驱动程序已对其进行设置。

This flag signals that the interface is connected to a point-to-point link. It is set by the driver or, sometimes, by ifconfig. For example, plip and the PPP driver have it set.

IFF_NOARP
IFF_NOARP

这意味着该接口不能执行ARP。例如,点对点接口不需要运行 ARP,这只会增加额外的流量,而不会检索有用的信息。snull 运行时没有 ARP 功能,因此它设置了该标志。

This means that the interface can't perform ARP. For example, point-to-point interfaces don't need to run ARP, which would only impose additional traffic without retrieving useful information. snull runs without ARP capabilities, so it sets the flag.

IFF_PROMISC
IFF_PROMISC

该标志被设置(由网络代码)以激活混杂操作。默认情况下,以太网接口使用硬件过滤器来确保它们仅接收广播数据包和定向到该接口硬件地址的数据包。数据包嗅探器(例如tcpdump)在接口上设置混杂模式,以便检索在接口传输介质上传输的所有数据包。

This flag is set (by the networking code) to activate promiscuous operation. By default, Ethernet interfaces use a hardware filter to ensure that they receive broadcast packets and packets directed to that interface's hardware address only. Packet sniffers such as tcpdump set promiscuous mode on the interface in order to retrieve all packets that travel on the interface's transmission medium.

IFF_MULTICAST
IFF_MULTICAST

该标志由驱动程序设置,用于标记能够进行多播传输的接口。ether_setup默认设置IFF_MULTICAST,因此如果您的驱动程序不支持多播,则必须在初始化时清除该标志。

This flag is set by drivers to mark interfaces that are capable of multicast transmission. ether_setup sets IFF_MULTICAST by default, so if your driver does not support multicast, it must clear the flag at initialization time.

IFF_ALLMULTI
IFF_ALLMULTI

该标志告诉接口接收所有多播数据包。仅当主机执行多播路由时,内核才会设置它IFF_MULTICASTIFF_ALLMULTI 对于驱动程序来说是只读的。本章后面的17.14 节中使用了多播标志。

This flag tells the interface to receive all multicast packets. The kernel sets it when the host performs multicast routing, only if IFF_MULTICAST is set. IFF_ALLMULTI is read-only for the driver. Multicast flags are used in Section 17.14 later in this chapter.

IFF_MASTER

IFF_SLAVE
IFF_MASTER

IFF_SLAVE

这些标志由负载均衡代码使用。接口驱动程序不需要了解它们。

These flags are used by the load equalization code. The interface driver doesn't need to know about them.

IFF_PORTSEL

IFF_AUTOMEDIA
IFF_PORTSEL

IFF_AUTOMEDIA

这些标志表明设备能够在多种媒体类型之间切换;例如, 非屏蔽双绞线 (UTP) 与同轴以太网电缆。如果IFF_AUTOMEDIA 设置,设备会自动选择正确的介质。实际上,内核不使用任何一个标志。

These flags signal that the device is capable of switching between multiple media types; for example, unshielded twisted pair (UTP) versus coaxial Ethernet cables. If IFF_AUTOMEDIA is set, the device selects the proper medium automatically. In practice, the kernel makes no use of either flag.

IFF_DYNAMIC
IFF_DYNAMIC

该标志由驱动程序设置,表明该接口的地址可以更改。它当前未被内核使用。

This flag, set by the driver, indicates that the address of this interface can change. It is not currently used by the kernel.

IFF_RUNNING
IFF_RUNNING

该标志表明接口已启动并正在运行。它主要是为了 BSD 兼容性而存在的;内核很少使用它。大多数网络驱动程序无需担心IFF_RUNNING

This flag indicates that the interface is up and running. It is mostly present for BSD compatibility; the kernel makes little use of it. Most network drivers need not worry about IFF_RUNNING.

IFF_NOTRAILERS
IFF_NOTRAILERS

该标志在 Linux 中未使用,但存在是为了兼容 BSD。

This flag is unused in Linux, but it exists for BSD compatibility.

当程序更改时IFF_UP, 将调用打开停止设备方法。此外,当IFF_UP或任何其他标志被修改时,set_multicast_list 方法被调用。如果驱动程序需要执行某些操作来响应标志的修改,则必须在 set_multicast_list中执行该操作。例如,当IFF_PROMISC设置或重置时,set_multicast_list必须通知板载硬件过滤器。第 17.14 节概述了该设备方法的职责。

When a program changes IFF_UP, the open or stop device method is called. Furthermore, when IFF_UP or any other flag is modified, the set_multicast_list method is invoked. If the driver needs to perform some action in response to a modification of the flags, it must take that action in set_multicast_list. For example, when IFF_PROMISC is set or reset, set_multicast_list must notify the onboard hardware filter. The responsibilities of this device method are outlined in Section 17.14.

该结构features的字段net_device由驱动程序设置,以告诉内核该接口具有的任何特殊硬件功能。我们将讨论其中一些功能;其他的超出了本书的范围。全套是:

The features field of the net_device structure is set by the driver to tell the kernel about any special hardware capabilities that this interface has. We will discuss some of these features; others are beyond the scope of this book. The full set is:

NETIF_F_SG

NETIF_F_FRAGLIST
NETIF_F_SG

NETIF_F_FRAGLIST

这两个标志都控制分散/聚集 I/O 的使用。如果您的接口可以传输已分割成多个不同内存段的数据包,则应设置NETIF_F_SG当然,您必须实际实现分散/聚集 I/O(我们在第 17.5.3 节中描述了如何完成)。NETIF_F_FRAGLIST说明您的接口可以处理已分片的数据包;在 2.6 中只有环回驱动程序执行此操作。

Both of these flags control the use of scatter/gather I/O. If your interface can transmit a packet that has been split into several distinct memory segments, you should set NETIF_F_SG. Of course, you have to actually implement the scatter/gather I/O (we describe how that is done in the Section 17.5.3). NETIF_F_FRAGLIST states that your interface can cope with packets that have been fragmented; only the loopback driver does this in 2.6.

请注意,如果内核不提供某种形式的校验和,则它不会对您的设备执行分散/聚集 I/O。原因是,如果内核必须传递碎片(“非线性”)数据包来计算校验和,它可能会同时复制数据并合并数据包。

Note that the kernel does not perform scatter/gather I/O to your device if it does not also provide some form of checksumming as well. The reason is that, if the kernel has to make a pass over a fragmented ("nonlinear") packet to calculate the checksum, it might as well copy the data and coalesce the packet at the same time.

NETIF_F_IP_CSUM

NETIF_F_NO_CSUM

NETIF_F_HW_CSUM
NETIF_F_IP_CSUM

NETIF_F_NO_CSUM

NETIF_F_HW_CSUM

这些标志都是告诉内核不需要对通过此接口离开系统的部分或所有数据包应用校验和的所有方法。设置NETIF_F_IP_CSUM您的接口是否可以校验 IP 数据包但不能校验其他数据包。如果此接口不需要校验和,请设置 NETIF_F_NO_CSUM。环回驱动程序设置此标志,snull也设置此标志;由于数据包仅通过系统内存传输,因此(希望如此!)它们不会被损坏,也无需检查它们。如果您的硬件本身进行校验和,请设置 NETIF_F_HW_CSUM

These flags are all ways of telling the kernel that it need not apply checksums to some or all packets leaving the system by this interface. Set NETIF_F_IP_CSUM if your interface can checksum IP packets but not others. If no checksums are ever required for this interface, set NETIF_F_NO_CSUM. The loopback driver sets this flag, and snull does, too; since packets are only transferred through system memory, there is (one hopes!) no opportunity for them to be corrupted, and no need to check them. If your hardware does checksumming itself, set NETIF_F_HW_CSUM.

NETIF_F_HIGHDMA
NETIF_F_HIGHDMA

如果您的设备可以对高端内存执行 DMA,请设置此标志。如果没有此标志,提供给驱动程序的所有数据包缓冲区都将分配在低内存中。

Set this flag if your device can perform DMA to high memory. In the absence of this flag, all packet buffers provided to your driver are allocated in low memory.

NETIF_F_HW_VLAN_TX

NETIF_F_HW_VLAN_RX

NETIF_F_HW_VLAN_FILTER

NETIF_F_VLAN_CHALLENGED
NETIF_F_HW_VLAN_TX

NETIF_F_HW_VLAN_RX

NETIF_F_HW_VLAN_FILTER

NETIF_F_VLAN_CHALLENGED

这些选项描述您的硬件对 802.1q VLAN 数据包的支持。VLAN 支持超出了我们在本章中所能讨论的范围。如果 VLAN 数据包使您的设备感到困惑(实际上不应该如此),请设置该NETIF_F_VLAN_CHALLENGED标志。

These options describe your hardware's support for 802.1q VLAN packets. VLAN support is beyond what we can cover in this chapter. If VLAN packets confuse your device (which they really shouldn't), set the NETIF_F_VLAN_CHALLENGED flag.

NETIF_F_TSO
NETIF_F_TSO

如果您的设备可以执行 TCP 分段卸载,请设置此标志。TSO 是一项高级功能,我们在此无法介绍。

Set this flag if your device can perform TCP segmentation offloading. TSO is an advanced feature that we cannot cover here.

设备方法

The Device Methods

正如发生的那样 字符和块驱动程序,每个网络设备都声明作用于它的函数。本节列出了可以在网络接口上执行的操作。有些操作可以保留NULL,而其他操作通常不受影响,因为 ether_setup为它们分配了合适的方法。

As happens with the char and block drivers, each network device declares the functions that act on it. Operations that can be performed on network interfaces are listed in this section. Some of the operations can be left NULL, and others are usually untouched because ether_setup assigns suitable methods to them.

网络接口的设备方法可分为两组:基本方法和可选方法。基本方法包括那些能够使用接口所需的方法;可选方法实现不严格要求的更高级功能。以下是基本方法:

Device methods for a network interface can be divided into two groups: fundamental and optional. Fundamental methods include those that are needed to be able to use the interface; optional methods implement more advanced functionalities that are not strictly required. The following are the fundamental methods:

int (*open)(struct net_device *dev);
int (*open)(struct net_device *dev);

打开界面。只要ifconfig激活该界面,就会打开该界面。open方法应该注册它需要的任何系统资源(I/O 端口、IRQ、DMA 等,打开硬件,并执行设备所需的任何其他设置。

Opens the interface. The interface is opened whenever ifconfig activates it. The open method should register any system resource it needs (I/O ports, IRQ, DMA, etc.), turn on the hardware, and perform any other setup your device requires.

int (*stop)(struct net_device *dev);
int (*stop)(struct net_device *dev);

停止接口。当接口被关闭时,接口就会停止。此功能应反转在打开时执行的操作。

Stops the interface. The interface is stopped when it is brought down. This function should reverse operations performed at open time.

int (*hard_start_xmit) (struct sk_buff *skb, struct net_device *dev);
int (*hard_start_xmit) (struct sk_buff *skb, struct net_device *dev);

启动数据包传输的方法。完整的数据包(协议头和所有)包含在套接字缓冲区 ( sk_buff) 结构中。本章稍后将介绍套接字缓冲区。

Method that initiates the transmission of a packet. The full packet (protocol headers and all) is contained in a socket buffer (sk_buff) structure. Socket buffers are introduced later in this chapter.

int (*hard_header) (struct sk_buff *skb, struct net_device *dev, unsigned

short type, void *daddr, void *saddr, unsigned len);
int (*hard_header) (struct sk_buff *skb, struct net_device *dev, unsigned

short type, void *daddr, void *saddr, unsigned len);

函数(在Hard_start_xmit之前调用)根据之前检索到的源和目标硬件地址构建硬件标头;它的工作是将作为参数传递给它的信息组织到适当的、特定于设备的硬件标头中。eth_header是类以太网接口的默认函数,ether_setup 相应地分配该字段。

Function (called before hard_start_xmit) that builds the hardware header from the source and destination hardware addresses that were previously retrieved; its job is to organize the information passed to it as arguments into an appropriate, device-specific hardware header. eth_header is the default function for Ethernet-like interfaces, and ether_setup assigns this field accordingly.

int (*rebuild_header)(struct sk_buff *skb);
int (*rebuild_header)(struct sk_buff *skb);

用于在 ARP 解析完成后、数据包传输之前重建硬件标头的函数。以太网设备使用的默认功能是使用 ARP 支持代码来填充数据包中缺失的信息。

Function used to rebuild the hardware header after ARP resolution completes but before a packet is transmitted. The default function used by Ethernet devices uses the ARP support code to fill the packet with missing information.

void (*tx_timeout)(struct net_device *dev);
void (*tx_timeout)(struct net_device *dev);

当数据包传输未能在合理的时间内完成时,网络代码调用的方法(假设已错过中断或接口已锁定)。它应该处理问题并恢复数据包传输。

Method called by the networking code when a packet transmission fails to complete within a reasonable period, on the assumption that an interrupt has been missed or the interface has locked up. It should handle the problem and resume packet transmission.

struct net_device_stats *(*get_stats)(struct net_device *dev);
struct net_device_stats *(*get_stats)(struct net_device *dev);

每当应用程序需要获取接口的统计信息时,就会调用此方法。例如,当 运行ifconfignetstat -i时,就会发生这种情况。第 17.13 节介绍了snull的示例实现。

Whenever an application needs to get statistics for the interface, this method is called. This happens, for example, when ifconfig or netstat -i is run. A sample implementation for snull is introduced in Section 17.13.

int (*set_config)(struct net_device *dev, struct ifmap *map);
int (*set_config)(struct net_device *dev, struct ifmap *map);

更改接口配置。该方法是配置驱动程序的入口点。设备的 I/O 地址及其中断号可以在运行时使用set_config进行更改。如果无法探测到接口,系统管理员可以使用此功能。现代硬件的驱动程序通常不需要实现此方法。

Changes the interface configuration. This method is the entry point for configuring the driver. The I/O address for the device and its interrupt number can be changed at runtime using set_config. This capability can be used by the system administrator if the interface cannot be probed for. Drivers for modern hardware normally do not need to implement this method.

剩余设备 操作是可选的:

The remaining device operations are optional:

int weight;

int (*poll)(struct net_device *dev; int *quota);
int weight;

int (*poll)(struct net_device *dev; int *quota);

方法 由符合 NAPI 的驱动程序提供,以轮询模式操作接口,并禁用中断。NAPI(和该领域)在第 17.8 节weight中介绍。

Method provided by NAPI-compliant drivers to operate the interface in a polled mode, with interrupts disabled. NAPI (and the weight field) are covered in Section 17.8.

void (*poll_controller)(struct net_device *dev);
void (*poll_controller)(struct net_device *dev);

在中断被禁用的情况下,要求驱动程序检查接口上的事件的功能。它用于特定的内核内网络任务,例如远程控制台和通过网络进行内核调试。

Function that asks the driver to check for events on the interface in situations where interrupts are disabled. It is used for specific in-kernel networking tasks, such as remote consoles and kernel debugging over the network.

int (*do_ioctl)(struct net_device *dev, struct ifreq *ifr, int cmd);
int (*do_ioctl)(struct net_device *dev, struct ifreq *ifr, int cmd);

执行特定于接口的ioctl命令。(这些命令的实现在第 17.12 节中描述。​​) 中的相应字段struct net_device可以保留,就好像NULL接口不需要任何特定于接口的命令一样。

Performs interface-specific ioctl commands. (Implementation of those commands is described in Section 17.12.) The corresponding field in struct net_device can be left as NULL if the interface doesn't need any interface-specific commands.

void (*set_multicast_list)(struct net_device *dev);
void (*set_multicast_list)(struct net_device *dev);

当设备的多播列表更改以及标志更改时调用的方法。有关更多详细信息和示例实现,请参阅第 17.14 节。

Method called when the multicast list for the device changes and when the flags change. See the Section 17.14 for further details and a sample implementation.

int (*set_mac_address)(struct net_device *dev, void *addr);
int (*set_mac_address)(struct net_device *dev, void *addr);

如果接口支持更改其硬件地址的能力,则可以实现的功能。许多接口根本不支持此功能。其他人使用默认的eth_mac_addr 实现(来自drivers/net/net_init.c)。 eth_mac_addr仅将新地址复制到 中dev->dev_addr,并且仅当接口未运行时才会这样做。使用eth_mac_addrdev->dev_addr的驱动程序应在其open方法中设置硬件 MAC 地址。

Function that can be implemented if the interface supports the ability to change its hardware address. Many interfaces don't support this ability at all. Others use the default eth_mac_addr implementation (from drivers/net/net_init.c). eth_mac_addr only copies the new address into dev->dev_addr, and it does so only if the interface is not running. Drivers that use eth_mac_addr should set the hardware MAC address from dev->dev_addr in their open method.

int (*change_mtu)(struct net_device *dev, int new_mtu);
int (*change_mtu)(struct net_device *dev, int new_mtu);

当接口的最大传输单元 (MTU) 发生变化时执行操作的函数。如果驱动程序需要在用户更改 MTU 时执行任何特定操作,则应声明自己的函数;否则,默认值会做正确的事情。 如果您有兴趣, snull有一个该函数的模板。

Function that takes action if there is a change in the maximum transfer unit (MTU) for the interface. If the driver needs to do anything particular when the MTU is changed by the user, it should declare its own function; otherwise, the default does the right thing. snull has a template for the function if you are interested.

int (*header_cache) (struct neighbour *neigh, struct hh_cache *hh);
int (*header_cache) (struct neighbour *neigh, struct hh_cache *hh);

header_cachehh_cache被调用来用 ARP 查询的结果填充该几乎所有类似以太网的驱动程序都可以使用默认的eth_header_cache 实现。

header_cache is called to fill in the hh_cache structure with the results of an ARP query. Almost all Ethernet-like drivers can use the default eth_header_cache implementation.

int (*header_cache_update) (struct hh_cache *hh, struct net_device *dev,

unsigned char *haddr);
int (*header_cache_update) (struct hh_cache *hh, struct net_device *dev,

unsigned char *haddr);

更新结构中的目标地址hh_cache以响应更改的方法。以太网设备使用 eth_header_cache_update

Method that updates the destination address in the hh_cache structure in response to a change. Ethernet devices use eth_header_cache_update.

int (*hard_header_parse) (struct sk_buff *skb, unsigned char *haddr);
int (*hard_header_parse) (struct sk_buff *skb, unsigned char *haddr);

hard_header_parse方法从 中包含的数据包中提取源地址,并将其复制到 处skb缓冲区中haddr。函数的返回值是 该地址的长度。以太网设备通常使用eth_header_parse

The hard_header_parse method extracts the source address from the packet contained in skb, copying it into the buffer at haddr. The return value from the function is the length of that address. Ethernet devices normally use eth_header_parse.

实用领域

Utility Fields

struct net_device接口使用剩余的数据字段来保存有用的状态信息。ifconfignetstat使用某些字段向用户提供有关当前配置的信息。因此,接口应该为这些字段赋值:

The remaining struct net_device data fields are used by the interface to hold useful status information. Some of the fields are used by ifconfig and netstat to provide the user with information about the current configuration. Therefore, an interface should assign values to these fields:

unsigned long trans_start;

unsigned long last_rx;
unsigned long trans_start;

unsigned long last_rx;

保存 jiffies 值的字段。驱动程序负责分别在传输开始时和接收到数据包时更新这些值。trans_start网络子系统使用该值来检测发射机锁定。last_rx目前未使用,但驱动程序无论如何都应该维护该字段,以备将来使用。

Fields that hold a jiffies value. The driver is responsible for updating these values when transmission begins and when a packet is received, respectively. The trans_start value is used by the networking subsystem to detect transmitter lockups. last_rx is currently unused, but the driver should maintain this field anyway to be prepared for future use.

int watchdog_timeo;
int watchdog_timeo;

在网络层判定已发生传输超时并调用驱动程序的tx_timeout函数之前应经过的最短时间(以 jiffies 为单位) 。

The minimum time (in jiffies) that should pass before the networking layer decides that a transmission timeout has occurred and calls the driver's tx_timeout function.

void *priv;
void *priv;

相当于filp->private_data. 在现代驱动程序中,该字段由alloc_netdev设置,不应直接访问;请改用netdev_priv

The equivalent of filp->private_data. In modern drivers, this field is set by alloc_netdev and should not be accessed directly; use netdev_priv instead.

struct dev_mc_list *mc_list;

int mc_count;
struct dev_mc_list *mc_list;

int mc_count;

处理多播传输的字段。mc_count是 中的项目数mc_list。详细信息请参见第 17.14 节。

Fields that handle multicast transmission. mc_count is the count of items in mc_list. See the Section 17.14 for further details.

spinlock_t xmit_lock;

int xmit_lock_owner;
spinlock_t xmit_lock;

int xmit_lock_owner;

用于xmit_lock避免多个同时调用驱动程序的 hard_start_xmit函数。xmit_lock_owner是已获取的CPU编号xmit_lock。驱动程序不应更改这些字段。

The xmit_lock is used to avoid multiple simultaneous calls to the driver's hard_start_xmit function. xmit_lock_owner is the number of the CPU that has obtained xmit_lock. The driver should make no changes to these fields.

还有其他字段 struct net_device,但网络驱动程序不使用它们。

There are other fields in struct net_device, but they are not used by network drivers.

开幕和闭幕

Opening and Closing

我们的驱动程序可以探测 模块加载时或内核启动时的接口。然而,在接口可以承载数据包之前,内核必须打开它并为其分配地址。内核打开或关闭接口以响应ifconfig 命令。

Our driver can probe for the interface at module load time or at kernel boot. Before the interface can carry packets, however, the kernel must open it and assign an address to it. The kernel opens or closes an interface in response to the ifconfig command.

当使用ifconfig为接口分配地址时,它执行两个任务。首先,它通过ioctl(SIOCSIFADDR)(Socket I/O Control Set Interface Address)来分配地址。然后,它 通过(套接字 I/O 控制设置接口标志)设置该IFF_UP位 以打开接口。dev->flagioctl(SIOCSIFFLAGS)

When ifconfig is used to assign an address to the interface, it performs two tasks. First, it assigns the address by means of ioctl(SIOCSIFADDR) (Socket I/O Control Set Interface Address). Then it sets the IFF_UP bit in dev->flag by means of ioctl(SIOCSIFFLAGS) (Socket I/O Control Set Interface Flags) to turn the interface on.

就设备而言,ioctl(SIOCSIFADDR) 什么也不做。不调用驱动程序函数——该任务是独立于设备的,并且由内核执行。ioctl(SIOCSIFFLAGS)然而,后一个命令 ( ) 调用设备的open方法。

As far as the device is concerned, ioctl(SIOCSIFADDR) does nothing. No driver function is invoked—the task is device independent, and the kernel performs it. The latter command (ioctl(SIOCSIFFLAGS)), however, calls the open method for the device.

同样,当接口关闭时,ifconfig使用ioctl(SIOCSIFFLAGS)clear IFF_UP,并调用stop方法。

Similarly, when the interface is shut down, ifconfig uses ioctl(SIOCSIFFLAGS) to clear IFF_UP, and the stop method is called.

两种设备方法0在成功时返回,在错误时返回通常的负值。

Both device methods return 0 in case of success and the usual negative value in case of error.

就实际代码而言,驱动程序必须执行许多与字符和块驱动程序相同的任务。open请求它需要的任何系统资源并告诉界面出现;stop关闭接口并释放系统资源。然而,网络驱动程序必须在打开时执行一些额外的步骤 。

As far as the actual code is concerned, the driver has to perform many of the same tasks as the char and block drivers do. open requests any system resources it needs and tells the interface to come up; stop shuts down the interface and releases system resources. Network drivers must perform some additional steps at open time, however.

首先,需要从硬件设备复制硬件(MAC)地址,然后dev->dev_addr接口才能与外界通信。然后可以在打开时将硬件地址复制到设备。snull软件接口open内部分配它;它只是使用长度为 的 ASCII 字符串ETH_ALEN(以太网硬件地址的长度)伪造一个硬件编号。

First, the hardware (MAC) address needs to be copied from the hardware device to dev->dev_addr before the interface can communicate with the outside world. The hardware address can then be copied to the device at open time. The snull software interface assigns it from within open; it just fakes a hardware number using an ASCII string of length ETH_ALEN, the length of Ethernet hardware addresses.

一旦准备好开始发送数据,open方法 还应该启动接口的传输队列(允许它接受数据包进行传输)。内核提供了启动队列的函数:

The open method should also start the interface's transmit queue (allowing it to accept packets for transmission) once it is ready to start sending data. The kernel provides a function to start the queue:

无效 netif_start_queue(struct net_device *dev);
void netif_start_queue(struct net_device *dev);

snull的开放代码如下所示:

The open code for snull looks like the following:

int snull_open(结构 net_device *dev)
{
    /* request_region( ), request_irq( ), .... (如 fops->open) */

    /*
     * 分配板子的硬件地址:使用“\0SNULx”,其中
     * x 为 0 或 1。第一个字节为 '\0' 以避免成为多播
     * 地址(多播地址的第一个字节是奇数)。
     */
    memcpy(dev->dev_addr, "\0SNUL0", ETH_ALEN);
    if (dev == snull_devs[1])
        dev->dev_addr[ETH_ALEN-1]++;/* \0SNUL1 */
    netif_start_queue(dev);
    返回0;
}
int snull_open(struct net_device *dev)
{
    /* request_region(  ), request_irq(  ), ....  (like fops->open) */

    /* 
     * Assign the hardware address of the board: use "\0SNULx", where
     * x is 0 or 1. The first byte is '\0' to avoid being a multicast
     * address (the first byte of multicast addrs is odd).
     */
    memcpy(dev->dev_addr, "\0SNUL0", ETH_ALEN);
    if (dev =  = snull_devs[1])
        dev->dev_addr[ETH_ALEN-1]++; /* \0SNUL1 */
    netif_start_queue(dev);
    return 0;
}

正如你所看到的,在没有真实硬件的情况下, open方法几乎没有什么可做的 。stop方法也是如此 ;它只是反转open的操作。因此,实现stop 的函数通常称为closerelease

As you can see, in the absence of real hardware, there is little to do in the open method. The same is true of the stop method; it just reverses the operations of open. For this reason, the function implementing stop is often called close or release.

int snull_release(struct net_device *dev)
{
    /* 释放端口、irq 等——如 fops->close */

    netif_stop_queue(dev); /* 无法再传输 */
    返回0;
}
int snull_release(struct net_device *dev)
{
    /* release ports, irq and such -- like fops->close */

    netif_stop_queue(dev); /* can't transmit any more */
    return 0;
}

功能:

The function:

无效 netif_stop_queue(struct net_device *dev);
void netif_stop_queue(struct net_device *dev);

与netif_start_queue相反;它将设备标记为无法再传输任何数据包。该函数必须在界面关闭时调用(在stop方法中),但也可以使用 暂时停止传输,如下一节所述。

is the opposite of netif_start_queue; it marks the device as being unable to transmit any more packets. The function must be called when the interface is closed (in the stop method) but can also be used to temporarily stop transmission, as explained in the next section.

数据包传输

Packet Transmission

最重要的任务 网络接口执行的是数据传输和接收。我们从传输开始,因为它更容易理解。

The most important tasks performed by network interfaces are data transmission and reception. We start with transmission because it is slightly easier to understand.

传输是指通过网络链路发送数据包的行为。每当内核需要传输数据包时,它都会调用驱动程序的 hard_start_transmit方法将数据放入传出队列中。内核处理的每个数据包都包含在套接字缓冲区结构 ( ) 中,其定义可在<linux/skbuff.h>struct sk_buff中找到 。该结构的名称来源于用于表示网络连接的 Unix 抽象,即套接字。即使接口与套接字无关,每个网络数据包都属于较高网络层的套接字,并且任何套接字的输入/输出缓冲区都是结构体列表。相同struct sk_buffsk_buff结构用于在所有 Linux 网络子系统中托管网络数据,但就接口而言,套接字缓冲区只是一个数据包。

Transmission refers to the act of sending a packet over a network link. Whenever the kernel needs to transmit a data packet, it calls the driver's hard_start_transmit method to put the data on an outgoing queue. Each packet handled by the kernel is contained in a socket buffer structure (struct sk_buff), whose definition is found in <linux/skbuff.h>. The structure gets its name from the Unix abstraction used to represent a network connection, the socket. Even if the interface has nothing to do with sockets, each network packet belongs to a socket in the higher network layers, and the input/output buffers of any socket are lists of struct sk_buff structures. The same sk_buff structure is used to host network data throughout all the Linux network subsystems, but a socket buffer is just a packet as far as the interface is concerned.

指向的指针sk_buff通常称为skb,我们在示例代码和文本中都遵循这种做法。

A pointer to sk_buff is usually called skb, and we follow this practice both in the sample code and in the text.

套接字缓冲区是一个复杂的结构,内核提供了许多对其进行操作的函数。这些功能稍后在第 17.10 节中描述;目前,一些基本事实sk_buff足以让我们编写一个可用的驱动程序。

The socket buffer is a complex structure, and the kernel offers a number of functions to act on it. The functions are described later in Section 17.10; for now, a few basic facts about sk_buff are enough for us to write a working driver.

传递给hard_start_xmit的套接字缓冲区包含应出现在媒体上的物理数据包,并包含传输级标头。接口不需要修改正在传输的数据。skb->data指向正在传输的数据包,并且skb->len是以八位位组为单位的长度。如果您的驱动程序可以处理分散/聚集 I/O,则这种情况会变得更加复杂;我们将在第 17.5.3 节中讨论这一点 。

The socket buffer passed to hard_start_xmit contains the physical packet as it should appear on the media, complete with the transmission-level headers. The interface doesn't need to modify the data being transmitted. skb->data points to the packet being transmitted, and skb->len is its length in octets. This situation gets a little more complicated if your driver can handle scatter/gather I/O; we get to that in Section 17.5.3.

snull包传输代码如下;物理传输机制已被隔离在另一个函数中,因为每个接口驱动程序都必须根据所驱动的特定硬件来实现它:

The snull packet transmission code follows; the physical transmission machinery has been isolated in another function, because every interface driver must implement it according to the specific hardware being driven:

int snull_tx(结构 sk_buff *skb, 结构 net_device *dev)
{
    int 长度;
    char *data,shortpkt[ETH_ZLEN];
    struct snull_priv *priv = netdev_priv(dev);
    
    数据 = skb->数据;
    len = skb->len;
    如果(len < ETH_ZLEN){
        memset(shortpkt, 0, ETH_ZLEN);
        memcpy(shortpkt, skb->data, skb->len);
        len = ETH_ZLEN;
        数据=短包;
    }
    dev->trans_start = jiffies; /* 保存时间戳 */

    /* 记住skb,这样我们就可以在中断时释放它 */
    priv->skb = skb;

    /* 数据的实际传递是特定于设备的,此处未显示 */
    snull_hw_tx(数据,len,dev);

    返回0;/* 我们的简单设备不会失败 */
}
int snull_tx(struct sk_buff *skb, struct net_device *dev)
{
    int len;
    char *data, shortpkt[ETH_ZLEN];
    struct snull_priv *priv = netdev_priv(dev);
    
    data = skb->data;
    len = skb->len;
    if (len < ETH_ZLEN) {
        memset(shortpkt, 0, ETH_ZLEN);
        memcpy(shortpkt, skb->data, skb->len);
        len = ETH_ZLEN;
        data = shortpkt;
    }
    dev->trans_start = jiffies; /* save the timestamp */

    /* Remember the skb, so we can free it at interrupt time */
    priv->skb = skb;

    /* actual deliver of data is device-specific, and not shown here */
    snull_hw_tx(data, len, dev);

    return 0; /* Our simple device can not fail */
}

因此,传输功能仅对数据包执行一些健全性检查,并通过硬件相关功能传输数据。但请注意,当要传输的数据包短于底层媒体(对于snull来说,是我们的虚拟“以太网”)支持的最小长度时,要小心。许多 Linux 网络驱动程序(以及其他操作系统的网络驱动程序)被发现在这种情况下会泄漏数据。我们不是创建这种安全漏洞,而是将短数据包复制到一个单独的数组中,我们可以显式地用零填充到媒体所需的完整长度。(我们可以安全地将数据放入堆栈,因为最小长度(60 字节)非常小)。

The transmission function, thus, just performs some sanity checks on the packet and transmits the data through the hardware-related function. Do note, however, the care that is taken when the packet to be transmitted is shorter than the minimum length supported by the underlying media (which, for snull, is our virtual "Ethernet"). Many Linux network drivers (and those for other operating systems as well) have been found to leak data in such situations. Rather than create that sort of security vulnerability, we copy short packets into a separate array that we can explicitly zero-pad out to the full length required by the media. (We can safely put that data on the stack, since the minimum length—60 bytes—is quite small).

Hard_start_xmit的返回值应该是0成功;此时,您的驱动程序已对数据包负责,应尽最大努力确保传输成功,并且必须在最后释放 skb。非零返回值表示此时无法传输数据包;内核将稍后重试。在这种情况下,您的驱动程序应该停止队列,直到导致故障的任何情况得到解决。

The return value from hard_start_xmit should be 0 on success; at that point, your driver has taken responsibility for the packet, should make its best effort to ensure that transmission succeeds, and must free the skb at the end. A nonzero return value indicates that the packet could not be transmitted at this time; the kernel will retry later. In this situation, your driver should stop the queue until whatever situation caused the failure has been resolved.

这里省略了 “硬件相关”传输函数(snull_hw_tx ),因为它完全用于实现snull设备的欺骗,包括操纵源地址和目标地址,并且对真实网络驱动程序的作者没有多大兴趣。当然,它存在于示例源代码中,供那些想要了解其工作原理的人使用。

The "hardware-related" transmission function (snull_hw_tx) is omitted here since it is entirely occupied with implementing the trickery of the snull device, including manipulating the source and destination addresses, and has little of interest to authors of real network drivers. It is present, of course, in the sample source for those who want to go in and see how it works.

控制传输并发

Controlling Transmission Concurrency

Hard_start_xmit函数 受到保护 xmit_lock来自结构中的自旋锁 ()的并发调用net_device。然而,一旦函数返回,它就可能被再次调用。当软件完成指示硬件有关数据包传输的操作时,该函数返回,但硬件传输可能尚未完成。这不是snull的问题,它使用 CPU 完成所有工作,因此数据包传输在传输函数返回之前完成。

The hard_start_xmit function is protected from concurrent calls by a spinlock (xmit_lock) in the net_device structure. As soon as the function returns, however, it may be called again. The function returns when the software is done instructing the hardware about packet transmission, but hardware transmission will likely not have been completed. This is not an issue with snull, which does all of its work using the CPU, so packet transmission is complete before the transmission function returns.

另一方面,真正的硬件接口异步传输数据包,并且可用于存储传出数据包的内存量有限。当该内存耗尽时(对于某些硬件,这种情况发生在单个未完成的数据包要传输时),驱动程序需要告诉网络系统不要开始任何传输,直到硬件准备好接受新数据。

Real hardware interfaces, on the other hand, transmit packets asynchronously and have a limited amount of memory available to store outgoing packets. When that memory is exhausted (which, for some hardware, happens with a single outstanding packet to transmit), the driver needs to tell the networking system not to start any more transmissions until the hardware is ready to accept new data.

此通知是通过调用netif_stop_queue来完成的,该函数是之前介绍的用于停止队列的函数。一旦你的驱动程序停止了它的队列,它 必须安排在将来的某个时刻重新启动队列,当它再次能够接受数据包进行传输时。为此,它应该调用:

This notification is accomplished by calling netif_stop_queue, the function introduced earlier to stop the queue. Once your driver has stopped its queue, it must arrange to restart the queue at some point in the future, when it is again able to accept packets for transmission. To do so, it should call:

无效 netif_wake_queue(struct net_device *dev);
void netif_wake_queue(struct net_device *dev);

该函数与netif_start_queue类似,不同之处在于它还触发网络系统以使其再次开始传输数据包。

This function is just like netif_start_queue, except that it also pokes the networking system to make it start transmitting packets again.

大多数现代网络硬件都维护一个内部队列,其中包含多个要传输的数据包;这样它就可以从网络中获得最佳性能。这些设备的网络驱动程序必须支持在任何给定时间有多个未完成的传输,但无论硬件是否支持多个未完成的传输,设备内存都可能会被填满。每当设备内存填满到没有空间容纳最大可能的数据包时,驱动程序就应该停止队列,直到空间再次可用。

Most modern network hardware maintains an internal queue with multiple packets to transmit; in this way it can get the best performance from the network. Network drivers for these devices must support having multiple transmisions outstanding at any given time, but device memory can fill up whether or not the hardware supports multiple outstanding transmissions. Whenever device memory fills to the point that there is no room for the largest possible packet, the driver should stop the queue until space becomes available again.

如果必须禁用数据包从Hard_start_xmit函数以外的任何地方传输 (也许是为了响应重新配置请求),您要使用的函数是:

If you must disable packet transmission from anywhere other than your hard_start_xmit function (in response to a reconfiguration request, perhaps), the function you want to use is:

无效 netif_tx_disable(struct net_device *dev);
void netif_tx_disable(struct net_device *dev);

此函数的行为与netif_stop_queue非常相似,但它还确保当它返回时,您的Hard_start_xmit方法不会在另一个 CPU 上运行。像往常一样,可以使用netif_wake_queue重新启动队列 。

This function behaves much like netif_stop_queue, but it also ensures that, when it returns, your hard_start_xmit method is not running on another CPU. The queue can be restarted with netif_wake_queue, as usual.

传输超时

Transmission Timeouts

大多数处理实际硬件的驱动程序都必须做好该硬件偶尔无法响应的准备。接口可能会忘记它们正在做什么,或者系统可能会丢失中断。此类问题在某些设计用于在个人计算机上运行的设备中很常见。

Most drivers that deal with real hardware have to be prepared for that hardware to fail to respond occasionally. Interfaces can forget what they are doing, or the system can lose an interrupt. This sort of problem is common with some devices designed to run on personal computers.

许多司机通过设置定时器来处理这个问题;如果在计时器到期时操作尚未完成,则说明出现了问题。事实上,网络系统本质上是由大量定时器控制的状态机的复杂集合。因此,网络代码可以很好地检测传输超时,作为其常规操作的一部分。

Many drivers handle this problem by setting timers; if the operation has not completed by the time the timer expires, something is wrong. The network system, as it happens, is essentially a complicated assembly of state machines controlled by a mass of timers. As such, the networking code is in a good position to detect transmission timeouts as part of its regular operation.

因此,网络驱动程序不必担心自己检测此类问题。相反,他们只需要设置一个超时期限,该期限位于watchdog_timeo结构字段中net_device 。这个时间段(以 jiffies 为单位)应该足够长,以考虑正常的传输延迟(例如网络媒体拥塞引起的冲突)。

Thus, network drivers need not worry about detecting such problems themselves. Instead, they need only set a timeout period, which goes in the watchdog_timeo field of the net_device structure. This period, which is in jiffies, should be long enough to account for normal transmission delays (such as collisions caused by congestion on the network media).

如果当前系统时间超过设备trans_start时间至少超时时间,网络层最终会调用驱动程序的tx_timeout方法。该方法的工作是采取一切必要措施来解决问题并确保正确完成已在进行的任何传输。特别重要的是,驱动程序不要丢失对网络代码委托给它的任何套接字缓冲区的跟踪。

If the current system time exceeds the device's trans_start time by at least the timeout period, the networking layer eventually calls the driver's tx_timeout method. That method's job is to do whatever is needed to clear up the problem and to ensure the proper completion of any transmissions that were already in progress. It is important, in particular, that the driver not lose track of any socket buffers that have been entrusted to it by the networking code.

snull能够模拟发射机锁定,这由两个加载时间参数控制:

snull has the ability to simulate transmitter lockups, which is controlled by two load-time parameters:

静态 int 锁定 = 0;
module_param(锁定, int, 0);

静态 int 超时 = SNULL_TIMEOUT;
module_param(超时, int, 0);
static int lockup = 0;
module_param(lockup, int, 0);

static int timeout = SNULL_TIMEOUT;
module_param(timeout, int, 0);

如果驱动程序加载了参数lockup=n,则每次n 传输数据包时都会模拟锁定,并将该watchdog_timeo字段设置为给定timeout值。当模拟锁定时,snull还会调用netif_stop_queue以防止发生其他传输尝试。

If the driver is loaded with the parameter lockup=n, a lockup is simulated once every n packets transmitted, and the watchdog_timeo field is set to the given timeout value. When simulating lockups, snull also calls netif_stop_queue to prevent other transmission attempts from occurring.

snull传输超时处理程序如下所示:

The snull transmission timeout handler looks like this:

void snull_tx_timeout (struct net_device *dev)
{
    struct snull_priv *priv = netdev_priv(dev);

    PDEBUG("传输超时 %ld,延迟 %ld\n", jiffies,
            jiffies - dev->trans_start);
        /* 模拟传输中断以使事物移动 */
    priv->状态 = SNULL_TX_INTR;
    snull_interrupt(0, dev, NULL);
    priv->stats.tx_errors++;
    netif_wake_queue(dev);
    返回;
}
void snull_tx_timeout (struct net_device *dev)
{
    struct snull_priv *priv = netdev_priv(dev);

    PDEBUG("Transmit timeout at %ld, latency %ld\n", jiffies,
            jiffies - dev->trans_start);
        /* Simulate a transmission interrupt to get things moving */
    priv->status = SNULL_TX_INTR;
    snull_interrupt(0, dev, NULL);
    priv->stats.tx_errors++;
    netif_wake_queue(dev);
    return;
}

当发生传输超时时,驱动程序必须在接口统计信息中标记错误,并安排设备重置为正常状态,以便可以传输新数据包。当snull发生超时时,驱动程序调用 snull_interrupt来填充“缺失”的中断并重新启动 使用netif_wake_queue传输队列 。

When a transmission timeout happens, the driver must mark the error in the interface statistics and arrange for the device to be reset to a sane state so that new packets can be transmitted. When a timeout happens in snull, the driver calls snull_interrupt to fill in the "missing" interrupt and restarts the transmit queue with netif_wake_queue.

分散/聚集 I/O

Scatter/Gather I/O

创建过程 在网络上传输的数据包涉及组装​​多个部分。数据包数据通常必须从用户空间复制进来,并且还必须添加网络堆栈各个级别使用的标头。该程序集可能需要大量的数据复制。然而,如果用于传输数据包的网络接口可以执行分散/聚集 I/O,则不需要将数据包组装成单个块,并且可以避免大部分复制。分散/聚集 I/O 还可以直接从用户空间缓冲区进行网络数据的“零复制”传输。

The process of creating a packet for transmission on the network involves assembling multiple pieces. Packet data must often be copied in from user space, and the headers used by various levels of the network stack must be added as well. This assembly can require a fair amount of data copying. If, however, the network interface that is destined to transmit the packet can perform scatter/gather I/O, the packet need not be assembled into a single chunk, and much of that copying can be avoided. Scatter/gather I/O also enables "zero-copy" transmission of network data directly from user-space buffers.

内核不会将分散的数据包传递给 Hard_start_xmit方法,除非该NETIF_F_SG位已在features设备结构的字段中设置。如果您设置了该标志,则需要查看 skb 中的特殊“共享信息”字段,以查看数据包是由单个片段还是由多个片段组成,并在需要时找到分散的片段。存在一个特殊的宏来访问此信息;它称为skb_shinfo。传输可能存在碎片的数据包时的第一步通常如下所示:

The kernel does not pass scattered packets to your hard_start_xmit method unless the NETIF_F_SG bit has been set in the features field of your device structure. If you have set that flag, you need to look at a special "shared info" field within the skb to see whether the packet is made up of a single fragment or many and to find the scattered fragments if need be. A special macro exists to access this information; it is called skb_shinfo. The first step when transmitting potentially fragmented packets usually looks something like this:

如果(skb_shinfo(skb)->nr_frags == 0){
    /* 像往常一样使用 skb->data 和 skb->len */
}
if (skb_shinfo(skb)->nr_frags =  = 0) {
    /* Just use skb->data and skb->len as usual */
}

nr_frags 字段告诉我们使用了多少个片段来构建数据包。如果是0,则该数据包存在于一个整体中,并且可以data像往常一样通过该字段进行访问。但是,如果它不为零,则您的驱动程序必须通过并安排传输每个单独的片段。skb 结构的字段data方便地指向第一个片段(与完整数据包相比,如在未分段的情况下)。片段的长度必须通过减去 skb->data_lenskb->len仍然包含完整数据包的长度)来计算。frags剩余的片段可以在共享信息结构中调用的数组中找到;中的每个条目frags都是一个skb_frag_struct 结构:

The nr_frags field tells how many fragments have been used to build the packet. If it is 0, the packet exists in a single piece and can be accessed via the data field as usual. If, however, it is nonzero, your driver must pass through and arrange to transfer each individual fragment. The data field of the skb structure points conveniently to the first fragment (as compared to the full packet, as in the unfragmented case). The length of the fragment must be calculated by subtracting skb->data_len from skb->len (which still contains the length of the full packet). The remaining fragments are to be found in an array called frags in the shared information structure; each entry in frags is an skb_frag_struct structure:

结构体skb_frag_struct {
    结构页*页;
    _ _u16 页偏移量;
    _ _u16 尺寸;
};
struct skb_frag_struct {
    struct page *page;
    _ _u16 page_offset;
    _ _u16 size;
};

正如您所看到的,我们再次处理page 结构,而不是内核虚拟地址。您的驱动程序应该循环遍历这些片段,映射每个片段以进行 DMA 传输,并且不要忘记第一个片段,该片段由 skb 直接指向。当然,您的硬件必须组装这些片段并将它们作为单个数据包进行传输。请注意,如果您设置了NETIF_F_HIGHDMA功能标志,则部分或全部片段可能位于高端内存中。

As you can see, we are once again dealing with page structures, rather than kernel virtual addresses. Your driver should loop through the fragments, mapping each for a DMA transfer and not forgetting the first fragment, which is pointed to by the skb directly. Your hardware, of course, must assemble the fragments and transmit them as a single packet. Note that, if you have set the NETIF_F_HIGHDMA feature flag, some or all of the fragments may be located in high memory.

数据包接收

Packet Reception

从网络接收数据比传输数据更棘手,因为必须在原子上下文中分配sk_buff并将其传递给上层。网络驱动程序可以实现两种数据包接收模式:中断驱动和轮询。大多数驱动程序都实现中断驱动技术,这就是我们首先介绍的技术。一些高带宽适配器的驱动程序也可能实现轮询技术;我们在第 17.8 节中研究这种方法。

Receiving data from the network is trickier than transmitting it, because an sk_buff must be allocated and handed off to the upper layers from within an atomic context. There are two modes of packet reception that may be implemented by network drivers: interrupt driven and polled. Most drivers implement the interrupt-driven technique, and that is the one we cover first. Some drivers for high-bandwidth adapters may also implement the polled technique; we look at this approach in the Section 17.8.

snull的实现将“硬件”细节与设备无关的内务处理分开。因此,在硬件收到数据包之后, 从snull “中断”处理程序 调用函数snull_rx ,并且该数据包已经在计算机内存中。snull_rx接收指向数据的指针和数据包的长度;它的唯一职责是将数据包和一些附加信息发送到网络代码的上层。该代码与获取数据指针和长度的方式无关。

The implementation of snull separates the "hardware" details from the device-independent housekeeping. Therefore, the function snull_rx is called from the snull "interrupt" handler after the hardware has received the packet, and it is already in the computer's memory. snull_rx receives a pointer to the data and the length of the packet; its sole responsibility is to send the packet and some additional information to the upper layers of networking code. This code is independent of the way the data pointer and length are obtained.

void snull_rx(struct net_device *dev, struct snull_packet *pkt)
{
    结构体sk_buff *skb;
    struct snull_priv *priv = netdev_priv(dev);

    /*
     * 数据包已从传输中检索到
     * 中等的。围绕它构建一个skb,以便上层可以处理它
     */
    skb = dev_alloc_skb(pkt->datalen + 2);
    如果(!skb){
        如果(printk_ratelimit())
            printk(KERN_NOTICE "snull rx: 内存不足 - 数据包丢失\n");
        priv->stats.rx_dropped++;
        转到出去;
    }
    memcpy(skb_put(skb, pkt->datalen), pkt->data, pkt->datalen);

    /* 写入元数据,然后传递给接收层 */
    skb->dev = dev;
    skb->协议 = eth_type_trans(skb, dev);
    skb->ip_summed = CHECKSUM_UNNECESSARY; /* 不检查 */
    priv->stats.rx_packets++;
    priv->stats.rx_bytes += pkt->datalen;
    netif_rx(skb);
  出去:
    返回;
}
void snull_rx(struct net_device *dev, struct snull_packet *pkt)
{
    struct sk_buff *skb;
    struct snull_priv *priv = netdev_priv(dev);

    /*
     * The packet has been retrieved from the transmission
     * medium. Build an skb around it, so upper layers can handle it
     */
    skb = dev_alloc_skb(pkt->datalen + 2);
    if (!skb) {
        if (printk_ratelimit(  ))
            printk(KERN_NOTICE "snull rx: low on mem - packet dropped\n");
        priv->stats.rx_dropped++;
        goto out;
    }
    memcpy(skb_put(skb, pkt->datalen), pkt->data, pkt->datalen);

    /* Write metadata, and then pass to the receive level */
    skb->dev = dev;
    skb->protocol = eth_type_trans(skb, dev);
    skb->ip_summed = CHECKSUM_UNNECESSARY; /* don't check it */
    priv->stats.rx_packets++;
    priv->stats.rx_bytes += pkt->datalen;
    netif_rx(skb);
  out:
    return;
}

该函数足够通用,可以充当任何网络驱动程序的模板,但在您可以放心地重用此代码片段之前,需要进行一些解释。

The function is sufficiently general to act as a template for any network driver, but some explanation is necessary before you can reuse this code fragment with confidence.

第一步是分配一个缓冲区来保存数据包。请注意,缓冲区分配函数(dev_alloc_skb)需要知道数据长度。函数使用该信息为缓冲区分配空间。dev_alloc_skb以原子优先级调用kmalloc,因此可以在中断时安全地使用。内核提供了套接字缓冲区分配的其他接口,但这里不值得介绍;套接字缓冲区在第 17.10 节中详细解释。

The first step is to allocate a buffer to hold the packet. Note that the buffer allocation function (dev_alloc_skb) needs to know the data length. The information is used by the function to allocate space for the buffer. dev_alloc_skb calls kmalloc with atomic priority, so it can be used safely at interrupt time. The kernel offers other interfaces to socket-buffer allocation, but they are not worth introducing here; socket buffers are explained in detail in Section 17.10.

当然,必须检查 dev_alloc_skb的返回值, snull就是这样做的。然而,我们在抱怨失败之前调用printk_ratelimit 。每秒生成数百或数千条控制台消息是使系统完全陷入困境并隐藏问题真正根源的好方法;printk_ratelimit在太多输出到达控制台时返回,有助于防止该问题 0,并且需要放慢速度。

Of course, the return value from dev_alloc_skb must be checked, and snull does so. We call printk_ratelimit before complaining about failures, however. Generating hundreds or thousands of console messages per second is a good way to bog down the system entirely and hide the real source of problems; printk_ratelimit helps prevent that problem by returning 0 when too much output has gone to the console, and things need to be slowed down a bit.

一旦有有效的指针,数据包数据就会通过调用memcpyskb复制到缓冲区中;skb_put 函数更新缓冲区中数据结束指针并返回指向新创建空间的指针。

Once there is a valid skb pointer, the packet data is copied into the buffer by calling memcpy; the skb_put function updates the end-of-data pointer in the buffer and returns a pointer to the newly created space.

如果您正在为可以执行完整总线主控 I/O 的接口编写高性能驱动程序,则这里有一个值得考虑的可能优化。某些驱动程序在接收传入数据包之前为其分配套接字缓冲区,然后指示接口将数据包数据直接放入套接字缓冲区的空间。网络层通过在支持 DMA 的空间中分配所有套接字缓冲区(如果您的设备设置了功能标志,则可能位于高内存中NETIF_F_HIGHDMA)来配合此策略。以这种方式执行操作可以避免需要单独的复制操作来填充套接字缓冲区,但需要小心缓冲区大小,因为您无法提前知道传入数据包有多大。实施一个在这种情况下, change_mtu方法也很重要,因为它允许驱动程序响应最大数据包大小的更改。

If you are writing a high-performance driver for an interface that can do full bus-mastering I/O, there is a possible optimization that is worth considering here. Some drivers allocate socket buffers for incoming packets prior to their reception, then instruct the interface to place the packet data directly into the socket buffer's space. The networking layer cooperates with this strategy by allocating all socket buffers in DMA-capable space (which may be in high memory if your device has the NETIF_F_HIGHDMA feature flag set). Doing things this way avoids the need for a separate copy operation to fill the socket buffer, but requires being careful with buffer sizes because you won't know in advance how big the incoming packet is. The implementation of a change_mtu method is also important in this situation, since it allows the driver to respond to a change in the maximum packet size.

网络层需要先阐明一些信息,然后才能理解数据包。为此,必须在缓冲区传递到楼上之前分配dev和字段。protocol以太网支持代码导出一个辅助函数 ( eth_type_trans ),该函数找到要放入的适当值 protocol。然后我们需要指定如何对数据包执行校验和(snull不需要执行任何校验和)。可能的政策有skb->ip_summed

The network layer needs to have some information spelled out before it can make sense of the packet. To this end, the dev and protocol fields must be assigned before the buffer is passed upstairs. The Ethernet support code exports a helper function (eth_type_trans), which finds an appropriate value to put into protocol. Then we need to specify how checksumming is to be performed or has been performed on the packet (snull does not need to perform any checksums). The possible policies for skb->ip_summed are:

CHECKSUM_HW
CHECKSUM_HW

该设备已经在硬件中执行了校验和。硬件校验和的一个示例是 SPARC HME 接口。

The device has already performed checksums in hardware. An example of a hardware checksum is the SPARC HME interface.

CHECKSUM_NONE
CHECKSUM_NONE

校验和尚未得到验证,该任务必须由系统软件来完成。这是新分配的缓冲区中的默认值。

Checksums have not yet been verified, and the task must be accomplished by system software. This is the default in newly allocated buffers.

CHECKSUM_UNNECESSARY
CHECKSUM_UNNECESSARY

不要做任何校验。这是snull和环回接口中的策略。

Don't do any checksums. This is the policy in snull and in the loopback interface.

features您可能想知道为什么当我们已经在结构体字段中设置了标志时必须在此处指定校验和状态net_device。答案是该features标志告诉内核我们的设备如何处理传出数据包。它不用于传入数据包,而是必须单独标记。

You may be wondering why the checksum status must be specified here when we have already set a flag in the features field of our net_device structure. The answer is that the features flag tells the kernel about how our device treats outgoing packets. It is not used for incoming packets, which must, instead, be marked individually.

最后,驱动程序更新其统计计数器以记录已接收到数据包。统计结构由几个字段组成;最重要的是 rx_packetsrx_bytestx_packetstx_bytes,它们包含接收和发送的数据包数量以及传输的八位字节总数。第 17.13 节详细描述了所有字段。

Finally, the driver updates its statistics counter to record that a packet has been received. The statistics structure is made up of several fields; the most important are rx_packets, rx_bytes, tx_packets, and tx_bytes, which contain the number of packets received and transmitted and the total number of octets transferred. All the fields are thoroughly described in Section 17.13.

数据包接收的最后一步由netif_rx执行,它将套接字缓冲区交给上层。netif_rx实际上返回一个整数值;NET_RX_SUCCESS( 0) 表示数据包已成功接收;任何其他值都表示有问题。有三个返回值(NET_RX_CN_LOWNET_RX_CN_MODNET_RX_CN_HIGH)指示网络子系统中拥塞程度的增加;NET_RX_DROP意味着数据包被丢弃。当拥塞变高时,驱动程序可以使用这些值来停止将数据包送入内核,但实际上,大多数驱动程序都会忽略 netif_rx的返回值。如果您正在为高带宽设备编写驱动程序,并希望采取正确的措施来应对拥塞,那么最好的方法是实现 NAPI,我们在快速讨论中断处理程序后会谈到这一点。

The last step in packet reception is performed by netif_rx, which hands off the socket buffer to the upper layers. netif_rx actually returns an integer value; NET_RX_SUCCESS (0) means that the packet was successfully received; any other value indicates trouble. There are three return values (NET_RX_CN_LOW, NET_RX_CN_MOD, and NET_RX_CN_HIGH) that indicate increasing levels of congestion in the networking subsystem; NET_RX_DROP means the packet was dropped. A driver could use these values to stop feeding packets into the kernel when congestion gets high, but, in practice, most drivers ignore the return value from netif_rx. If you are writing a driver for a high-bandwidth device and wish to do the right thing in response to congestion, the best approach is to implement NAPI, which we get to after a quick discussion of interrupt handlers.

中断处理程序

The Interrupt Handler

大多数硬件接口都是通过中断处理程序来控制的。硬件中断处理器以发出两个可能事件之一的信号:新数据包已到达或传出数据包的传输已完成。网络接口还可以生成中断来指示错误、链路状态更改等。

Most hardware interfaces are controlled by means of an interrupt handler. The hardware interrupts the processor to signal one of two possible events: a new packet has arrived or transmission of an outgoing packet is complete. Network interfaces can also generate interrupts to signal errors, link status changes, and so on.

通常的中断例程可以通过检查物理设备上的状态寄存器来区分新数据包到达中断和完成发送通知之间的区别。snull接口的工作原理类似,但它的状态字是在软件中实现的,并且存在于dev->priv. 网络接口的中断处理程序如下所示:

The usual interrupt routine can tell the difference between a new-packet-arrived interrupt and a done-transmitting notification by checking a status register found on the physical device. The snull interface works similarly, but its status word is implemented in software and lives in dev->priv. The interrupt handler for a network interface looks like this:

静态无效 snull_regular_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
    int 状态字;
    结构 snull_priv *priv;
    struct snull_packet *pkt = NULL;
    /*
     * 像往常一样,检查“设备”指针以确保它是
     * 真的很打扰。
     * 然后分配“struct device *dev”
     */
    struct net_device *dev = (struct net_device *)dev_id;
    /* ...并检查硬件是否真的是我们的 */

    /* 偏执 */
    如果(!dev)
        返回;

    /* 锁定设备 */
    priv = netdev_priv(dev);
    spin_lock(&priv->lock);

    /* 检索状态字:真实网络设备使用 I/O 指令 */
    状态字 = 隐私->状态;
    隐私->状态=0;
    if (状态字 & SNULL_RX_INTR) {
        /* 发送给 snull_rx 进行处理 */
        pkt = priv->rx_queue;
        如果(包){
            priv->rx_queue = pkt->下一个;
            snull_rx(dev, pkt);
        }
    }
    if (状态字 & SNULL_TX_INTR) {
        /* 传输结束:释放 skb */
        priv->stats.tx_packets++;
        priv->stats.tx_bytes += priv->tx_packetlen;
        dev_kfree_skb(priv->skb);
    }

    /* 解锁设备,我们就完成了 */
    spin_unlock(&priv->lock);
    if (pkt) snull_release_buffer(pkt); /* 在锁外执行此操作!*/
    返回;
}
static void snull_regular_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
    int statusword;
    struct snull_priv *priv;
    struct snull_packet *pkt = NULL;
    /*
     * As usual, check the "device" pointer to be sure it is
     * really interrupting.
     * Then assign "struct device *dev"
     */
    struct net_device *dev = (struct net_device *)dev_id;
    /* ... and check with hw if it's really ours */

    /* paranoid */
    if (!dev)
        return;

    /* Lock the device */
    priv = netdev_priv(dev);
    spin_lock(&priv->lock);

    /* retrieve statusword: real netdevices use I/O instructions */
    statusword = priv->status;
    priv->status = 0;
    if (statusword & SNULL_RX_INTR) {
        /* send it to snull_rx for handling */
        pkt = priv->rx_queue;
        if (pkt) {
            priv->rx_queue = pkt->next;
            snull_rx(dev, pkt);
        }
    }
    if (statusword & SNULL_TX_INTR) {
        /* a transmission is over: free the skb */
        priv->stats.tx_packets++;
        priv->stats.tx_bytes += priv->tx_packetlen;
        dev_kfree_skb(priv->skb);
    }

    /* Unlock the device and we are done */
    spin_unlock(&priv->lock);
    if (pkt) snull_release_buffer(pkt); /* Do this outside the lock! */
    return;
}

处理程序的第一个任务是检索指向正确的struct net_device. 该指针通常来自dev_id作为参数接收的指针。

The handler's first task is to retrieve a pointer to the correct struct net_device. This pointer usually comes from the dev_id pointer received as an argument.

该处理程序的有趣部分涉及“传输完成”情况。在这种情况下,统计信息会更新,并且dev_kfree_skb 调用以将(不再需要的)套接字缓冲区返回给系统。实际上,可以调用该函数的三种变体:

The interesting part of this handler deals with the "transmission done" situation. In this case, the statistics are updated, and dev_kfree_skb is called to return the (no longer needed) socket buffer to the system. There are, actually, three variants of this function that may be called:

dev_kfree_skb(struct sk_buff *skb);
dev_kfree_skb(struct sk_buff *skb);

当您知道您的代码不会在中断上下文中运行时,应该调用此版本。由于snull没有实际的硬件中断,因此这是我们使用的版本。

This version should be called when you know that your code will not be running in interrupt context. Since snull has no actual hardware interrupts, this is the version we use.

dev_kfree_skb_irq(struct sk_buff *skb);
dev_kfree_skb_irq(struct sk_buff *skb);

如果您知道将在中断处理程序中释放缓冲区,请使用此版本,它针对这种情况进行了优化。

If you know that you will be freeing the buffer in an interrupt handler, use this version, which is optimized for that case.

dev_kfree_skb_any(struct sk_buff *skb);
dev_kfree_skb_any(struct sk_buff *skb);

如果相关代码可以在中断或非中断上下文中运行,则可以使用此版本。

This is the version to use if the relevant code could be running in either interrupt or noninterrupt context.

最后,如果您的驱动程序暂时停止了传输队列,通常可以在此处使用netif_wake_queue重新启动它。

Finally, if your driver has temporarily stopped the transmission queue, this is usually the place to restart it with netif_wake_queue.

与传输相比,数据包接收不需要任何特殊的中断处理。调用snull_rx(我们已经见过)就是所需要的。

Packet reception, in contrast to transmission, doesn't need any special interrupt handling. Calling snull_rx (which we have already seen) is all that's required.

接收中断缓解

Receive Interrupt Mitigation

当有网络时 驱动程序的编写方式如上所述,处理器会因接口收到的每个数据包而中断。在许多情况下,这是所需的操作模式,而且不是问题。然而,高带宽接口每秒可以接收数千个数据包。由于这种中断负载,系统的整体性能可能会受到影响。

When a network driver is written as we have described above, the processor is interrupted for every packet received by your interface. In many cases, that is the desired mode of operation, and it is not a problem. High-bandwidth interfaces, however, can receive thousands of packets per second. With that sort of interrupt load, the overall performance of the system can suffer.

作为提高 Linux 在高端系统上性能的一种方法,网络子系统开发人员创建了一种基于轮询的替代接口(称为 NAPI) [ 1 ] 。对于驱动程序开发人员来说,“轮询”可能是一个肮脏的词,他们经常认为轮询技术不优雅且效率低下。然而,只有当没有工作要做时才轮询接口时,轮询效率很低。当系统具有处理大量流量的高速接口时,总是有更多的数据包需要处理。在这种情况下不需要中断处理器;每隔一段时间从接口收集新数据包就足够了。

As a way of improving the performance of Linux on high-end systems, the networking subsystem developers have created an alternative interface (called NAPI)[1] based on polling. "Polling" can be a dirty word among driver developers, who often see polling techniques as inelegant and inefficient. Polling is inefficient, however, only if the interface is polled when there is no work to do. When the system has a high-speed interface handling heavy traffic, there is always more packets to process. There is no need to interrupt the processor in such situations; it is enough that the new packets be collected from the interface every so often.

停止接收中断可以减轻处理器的大量负载。如果这些数据包由于拥塞而在网络代码中被丢弃,那么符合 NAPI 的驱动程序还可以被告知不要将数据包送入内核,这也可以在最需要帮助时提高性能。由于各种原因,NAPI 驱动程序也不太可能对数据包重新排序。

Stopping receive interrupts can take a substantial amount of load off the processor. NAPI-compliant drivers can also be told not to feed packets into the kernel if those packets are just dropped in the networking code due to congestion, which can also help performance when that help is needed most. For various reasons, NAPI drivers are also less likely to reorder packets.

然而,并非所有设备都可以在 NAPI 模式下运行。支持 NAPI 的接口必须能够存储多个数据包(在卡本身上,或在内存 DMA 环中)。该接口应该能够禁用接收数据包的中断,同时继续对成功传输和其他事件进行中断。还有其他一些微妙的问题可能会使编写符合 NAPI 的驱动程序变得更加困难;有关详细信息,请参阅内核源代码树中的Documentation/networking/NAPI_HOWTO.txt 。

Not all devices can operate in the NAPI mode, however. A NAPI-capable interface must be able to store several packets (either on the card itself, or in an in-memory DMA ring). The interface should be capable of disabling interrupts for received packets, while continuing to interrupt for successful transmissions and other events. There are other subtle issues that can make writing a NAPI-compliant driver harder; see Documentation/networking/NAPI_HOWTO.txt in the kernel source tree for the details.

实现 NAPI 接口的驱动程序相对较少。然而,如果您正在为可能生成大量中断的接口编写驱动程序,那么花时间实现 NAPI 可能是值得的。

Relatively few drivers implement the NAPI interface. If you are writing a driver for an interface that may generate a huge number of interrupts, however, taking the time to implement NAPI may well prove worthwhile.

当加载参数设置为非零值时,snull驱动程序将在 NAPI 模式下运行。use_napi在初始化时,我们必须设置几个额外的struct net_device字段:

The snull driver, when loaded with the use_napi parameter set to a nonzero value, operates in the NAPI mode. At initialization time, we have to set up a couple of extra struct net_device fields:

如果(use_napi){
    dev->poll = snull_poll;
    开发->权重=2;
}
if (use_napi) {
    dev->poll        = snull_poll;
    dev->weight      = 2;
}

poll字段必须设置为您的驱动程序的轮询功能;我们稍后看一下snull_poll。该weight字段描述了接口的相对重要性:当资源紧张时,应该从接口接受多少流量。weight对于如何设置参数没有严格的规定;按照惯例,10 MBps 以太网接口设置weight16,而更快的接口则使用64。您不应设置weight 大于接口可以存储的数据包数量的值。在 snull中,我们将weight2 设置为演示延迟数据包接收的方式。

The poll field must be set to your driver's polling function; we look at snull_poll shortly. The weight field describes the relative importance of the interface: how much traffic should be accepted from the interface when resources are tight. There are no strict rules for how the weight parameter should be set; by convention, 10 MBps Ethernet interfaces set weight to 16, while faster interfaces use 64. You should not set weight to a value greater than the number of packets your interface can store. In snull, we set the weight to two as a way of demonstrating deferred packet reception.

创建符合 NAPI 的驱动程序的下一步是更改中断处理程序。当您的接口(应以启用接收中断的方式启动)发出数据包已到达的信号时,中断处理程序不应处理该数据包。相反,它应该禁用进一步的接收中断并告诉内核是时候开始轮询接口了。在snull “中断”处理程序中,响应数据包接收中断的代码已更改为以下内容:

The next step in the creation of a NAPI-compliant driver is to change the interrupt handler. When your interface (which should start with receive interrupts enabled) signals that a packet has arrived, the interrupt handler should not process that packet. Instead, it should disable further receive interrupts and tell the kernel that it is time to start polling the interface. In the snull "interrupt" handler, the code that responds to packet reception interrupts has been changed to the following:

if (状态字 & SNULL_RX_INTR) {
    snull_rx_ints(dev, 0); /* 禁止进一步的中断 */
    netif_rx_schedule(dev);
}
if (statusword & SNULL_RX_INTR) {
    snull_rx_ints(dev, 0);  /* Disable further interrupts */
    netif_rx_schedule(dev);
}

当接口告诉我们有一个数据包可用时,中断处理程序会将其留在接口中;此时需要发生的就是调用 netif_rx_schedule,这会导致我们的poll方法在将来的某个时刻被调用。

When the interface tells us that a packet is available, the interrupt handler leaves it in the interface; all that needs to happen at this point is a call to netif_rx_schedule, which causes our poll method to be called at some future point.

poll方法原型如下:

The poll method has this prototype:

int (*poll)(struct net_device *dev, int *budget);
int (*poll)(struct net_device *dev, int *budget);

poll方法的 snull 实现如下所示

The snull implementation of the poll method looks like this:

static int snull_poll(struct net_device *dev, int *budget)
{
    int npackets = 0, 配额 = min(dev->quota, *budget);
    结构体sk_buff *skb;
    struct snull_priv *priv = netdev_priv(dev);
    结构 snull_packet *pkt;
    
    while (npackets < 配额 && priv->rx_queue) {
        pkt = snull_dequeue_buf(dev);
        skb = dev_alloc_skb(pkt->datalen + 2);
        如果 (!skb) {
            如果(printk_ratelimit())
                printk(KERN_NOTICE "snull: 数据包丢失\n");
            priv->stats.rx_dropped++;
            snull_release_buffer(pkt);
            继续;
        }
        memcpy(skb_put(skb, pkt->datalen), pkt->data, pkt->datalen);
        skb->dev = dev;
        skb->协议 = eth_type_trans(skb, dev);
        skb->ip_summed = CHECKSUM_UNNECESSARY; /* 不检查 */
        netif_receive_skb(skb);
        
            /* 维护统计数据 */
        npackets++;
        priv->stats.rx_packets++;
        priv->stats.rx_bytes += pkt->datalen;
        snull_release_buffer(pkt);
    }
    /* 如果我们处理了所有数据包,我们就完成了;告诉内核并重新启用 ints */
    *预算-=npackets;
    dev->quota -= npackets;
    if (!priv->rx_queue) {
        netif_rx_complete(开发);
        snull_rx_ints(dev, 1);
        返回0;
    }
    /* 我们无法处理所有内容。*/
    返回1;
}
static int snull_poll(struct net_device *dev, int *budget)
{
    int npackets = 0, quota = min(dev->quota, *budget);
    struct sk_buff *skb;
    struct snull_priv *priv = netdev_priv(dev);
    struct snull_packet *pkt;
    
    while (npackets < quota && priv->rx_queue) {
        pkt = snull_dequeue_buf(dev);
        skb = dev_alloc_skb(pkt->datalen + 2);
        if (! skb) {
            if (printk_ratelimit(  ))
                printk(KERN_NOTICE "snull: packet dropped\n");
            priv->stats.rx_dropped++;
            snull_release_buffer(pkt);
            continue;
        }
        memcpy(skb_put(skb, pkt->datalen), pkt->data, pkt->datalen);
        skb->dev = dev;
        skb->protocol = eth_type_trans(skb, dev);
        skb->ip_summed = CHECKSUM_UNNECESSARY; /* don't check it */
        netif_receive_skb(skb);
        
            /* Maintain stats */
        npackets++;
        priv->stats.rx_packets++;
        priv->stats.rx_bytes += pkt->datalen;
        snull_release_buffer(pkt);
    }
    /* If we processed all packets, we're done; tell the kernel and reenable ints */
    *budget -= npackets;
    dev->quota -= npackets;
    if (! priv->rx_queue) {
        netif_rx_complete(dev);
        snull_rx_ints(dev, 1);
        return 0;
    }
    /* We couldn't process everything. */
    return 1;
}

该函数的核心部分涉及创建保存数据包的 skb;这段代码与我们之前在snull_rx中看到的相同。然而,有很多事情是不同的:

The central part of the function is concerned with the creation of an skb holding the packet; this code is the same as what we saw in snull_rx before. A number of things are different, however:

  • budget参数提供了允许我们传递到内核的最大数据包数量。在器件结构中,该 quota字段给出了另一个最大值;poll 方法必须遵守两个限制中的较低者它还应该减少dev->quota*budget实际收到的数据包数量。该 budget值是当前CPU可以从所有接口接收的最大数据包数,同时quota是每个接口的值,通常weight在初始化时分配给接口。

  • The budget parameter provides a maximum number of packets that we are allowed to pass into the kernel. Within the device structure, the quota field gives another maximum; the poll method must respect the lower of the two limits. It should also decrement both dev->quota and *budget by the number of packets actually received. The budget value is a maximum number of packets that the current CPU can receive from all interfaces, while quota is a per-interface value that usually starts out as the weight assigned to the interface at initialization time.

  • 应使用netif_receive_skb而不是netif_rx将数据包馈送到内核。

  • Packets should be fed to the kernel with netif_receive_skb, rather than netif_rx.

  • 如果poll方法能够在给定的限制内处理所有可用的数据包,它应该重新启用接收中断,调用 netif_rx_complete关闭轮询,并返回0。返回值1 表示还有数据包需要处理。

  • If the poll method is able to process all of the available packets within the limits given to it, it should re-enable receive interrupts, call netif_rx_complete to turn off polling, and return 0. A return value of 1 indicates that there are packets remaining to be processed.

网络子系统保证任何给定设备的轮询 方法不会在多个处理器上同时调用。但是,对poll的调用 仍然可以与对其他设备方法的调用同时发生。

The networking subsystem guarantees that any given device's poll method will not be called concurrently on more than one processor. Calls to poll can still happen concurrently with calls to your other device methods, however.

链接状态的变化

Changes in Link State

根据定义,网络连接涉及本地系统之外的世界。因此,它们常常受到外界事件的影响,而且可能是短暂的事情。网络子系统需要知道网络链路何时启动或关闭,并且它提供了一些驱动程序可以用来传达该信息的函数。

Network connections, by definition, deal with the world outside the local system. Therefore, they are often affected by outside events, and they can be transient things. The networking subsystem needs to know when network links go up or down, and it provides a few functions that the driver may use to convey that information.

大多数涉及实际物理连接的网络技术都提供运营商状态;载体的存在意味着硬件已经存在并准备好运行。例如,以太网适配器感测线路上的载波信号;当用户被电缆绊倒时,该载体就会消失,链路就会中断。默认情况下,假定网络设备存在载波信号。但是,驱动程序可以使用以下函数显式更改该状态:

Most networking technologies involving an actual, physical connection provide a carrier state; the presence of the carrier means that the hardware is present and ready to function. Ethernet adapters, for example, sense the carrier signal on the wire; when a user trips over the cable, that carrier vanishes, and the link goes down. By default, network devices are assumed to have a carrier signal present. The driver can change that state explicitly, however, with these functions:

无效 netif_carrier_off(struct net_device *dev);
无效 netif_carrier_on(struct net_device *dev);
void netif_carrier_off(struct net_device *dev);
void netif_carrier_on(struct net_device *dev);

如果您的驱动程序检测到其设备之一缺少运营商,则应调用 netif_carrier_off来通知内核此更改。当运营商返回时,应调用netif_carrier_on 。一些驱动程序在进行重大配置更改(例如媒体类型)时也会调用netif_carrier_off ;一旦适配器完成自身重置,就会检测到新的运营商并且流量可以恢复。

If your driver detects a lack of carrier on one of its devices, it should call netif_carrier_off to inform the kernel of this change. When the carrier returns, netif_carrier_on should be called. Some drivers also call netif_carrier_off when making major configuration changes (such as media type); once the adapter has finished resetting itself, the new carrier is detected and traffic can resume.

还存在整数函数:

An integer function also exists:

int netif_carrier_ok(struct net_device *dev);
int netif_carrier_ok(struct net_device *dev);

这可用于测试当前的载体状态(如器件结构中所反映的)。

This can be used to test the current carrier state (as reflected in the device structure).

套接字缓冲区

The Socket Buffers

我们现在已经涵盖了 大多数问题与网络接口有关。仍然缺少对sk_buff结构的一些更详细的讨论。该结构体是Linux内核网络子系统的核心,现在我们介绍该结构体的主要字段以及作用于它的函数。

We've now covered most of the issues related to network interfaces. What's still missing is some more detailed discussion of the sk_buff structure. The structure is at the core of the network subsystem of the Linux kernel, and we now introduce both the main fields of the structure and the functions used to act on it.

尽管没有严格要求了解 的内部结构sk_buff,但是当您跟踪问题以及尝试优化代码时,查看其内容的能力会很有帮助。例如,如果您查看loopback.c,您会发现基于sk_buff内部知识的优化。通常的警告适用于此:如果您编写的代码利用了结构知识sk_buff,您应该准备好看到它在未来的内核版本中崩溃。尽管如此,有时性能优势证明额外的维护成本是值得的。

Although there is no strict need to understand the internals of sk_buff, the ability to look at its contents can be helpful when you are tracking down problems and when you are trying to optimize your code. For example, if you look in loopback.c, you'll find an optimization based on knowledge of the sk_buff internals. The usual warning applies here: if you write code that takes advantage of knowledge of the sk_buff structure, you should be prepared to see it break with future kernel releases. Still, sometimes the performance advantages justify the additional maintenance cost.

我们不打算在这里描述整个结构,只描述驱动程序中可能使用的字段。如果您想了解更多信息,可以查看 <linux/skbuff.h>,其中定义了结构并原型化了函数。有关如何使用字段和函数的其他详细信息可以通过在内核源代码中进行 grep 轻松检索。

We are not going to describe the whole structure here, just the fields that might be used from within a driver. If you want to see more, you can look at <linux/skbuff.h>, where the structure is defined and the functions are prototyped. Additional details about how the fields and functions are used can be easily retrieved by grepping in the kernel sources.

重要领域

The Important Fields

这里介绍的字段是驱动程序可能需要访问的字段。它们没有特定的顺序列出。

The fields introduced here are the ones a driver might need to access. They are listed in no particular order.

struct net_device *dev;
struct net_device *dev;

接收或发送此缓冲区的设备。

The device receiving or sending this buffer.

union { /* ... */ } h;

union { /* ... */ } nh;

union { /*... */} mac;
union { /* ... */ } h;

union { /* ... */ } nh;

union { /*... */} mac;

指向数据包中包含的各个级别标头的指针。联合的每个字段都是指向不同类型数据结构的指针。h主机指向传输层标头的指针(例如,struct tcphdr *th);nh 包括网络层标头(例如struct iphdr *iph);并mac 收集指向链路层标头的指针(例如struct ethdr *ethernet)。

Pointers to the various levels of headers contained within the packet. Each field of the union is a pointer to a different type of data structure. h hosts pointers to transport layer headers (for example, struct tcphdr *th); nh includes network layer headers (such as struct iphdr *iph); and mac collects pointers to link-layer headers (such as struct ethdr *ethernet).

如果您的驱动程序需要查看 TCP 数据包的源地址和目标地址,它可以在skb->h.th. 有关可以通过这种方式访问​​的完整标头类型,请参阅标头文件。

If your driver needs to look at the source and destination addresses of a TCP packet, it can find them in skb->h.th. See the header file for the full set of header types that can be accessed in this way.

请注意,网络驱动程序负责设置mac传入数据包的指针。该任务通常由 eth_type_trans处理,但非以太网驱动程序必须直接设置,如第 17.11.3 节skb->mac.raw所示。

Note that network drivers are responsible for setting the mac pointer for incoming packets. This task is normally handled by eth_type_trans, but non-Ethernet drivers have to set skb->mac.raw directly, as shown in Section 17.11.3.

unsigned char *head;

unsigned char *data;

unsigned char *tail;

unsigned char *end;
unsigned char *head;

unsigned char *data;

unsigned char *tail;

unsigned char *end;

用于寻址数据包中的数据的指针。head指向分配空间的开头,data是有效八位字节的开头(通常略大于head),tail是有效八位字节的结尾,并且end指向tail可以到达的最大地址。另一种看待它的方式是, 可用缓冲区空间为skb->end - skb->head当前使用的数据空间为skb->tail - skb->data

Pointers used to address the data in the packet. head points to the beginning of the allocated space, data is the beginning of the valid octets (and is usually slightly greater than head), tail is the end of the valid octets, and end points to the maximum address tail can reach. Another way to look at it is that the available buffer space is skb->end - skb->head, and the currently used data space is skb->tail - skb->data.

unsigned int len;

unsigned int data_len;
unsigned int len;

unsigned int data_len;

len是数据包中数据的完整长度,而data_len是存储在单独片段中的数据包部分的长度。除非使用分散/聚集 I/O,否则该data_len字段为空。0

len is the full length of the data in the packet, while data_len is the length of the portion of the packet stored in separate fragments. The data_len field is 0 unless scatter/gather I/O is being used.

unsigned char ip_summed;
unsigned char ip_summed;

该数据包的校验和策略。该字段由驱动程序在传入数据包上设置,如第 17.6 节中所述。

The checksum policy for this packet. The field is set by the driver on incoming packets, as described in the Section 17.6.

unsigned char pkt_type;
unsigned char pkt_type;

其传送中使用的数据包分类。驱动程序负责将其设置为PACKET_HOST(这个数据包是给我的),PACKET_OTHERHOST(不,这个数据包不是给我的),, PACKET_BROADCASTPACKET_MULTICAST。以太网驱动程序不会pkt_type显式修改,因为 eth_type_trans会为它们做这件事。

Packet classification used in its delivery. The driver is responsible for setting it to PACKET_HOST (this packet is for me), PACKET_OTHERHOST (no, this packet is not for me), PACKET_BROADCAST, or PACKET_MULTICAST. Ethernet drivers don't modify pkt_type explicitly because eth_type_trans does it for them.

shinfo(struct sk_buff *skb);

unsigned int shinfo(skb)->nr_frags;

skb_frag_t shinfo(skb)->frags;
shinfo(struct sk_buff *skb);

unsigned int shinfo(skb)->nr_frags;

skb_frag_t shinfo(skb)->frags;

出于性能原因,一些 skb 信息存储在一个单独的结构中,该结构紧随内存中的 skb 之后出现。这个“共享信息”(之所以这样称呼是因为它可以在网络代码中的 skb 副本之间共享)必须通过shinfo宏进行访问。这个结构中有几个领域,但其中大部分超出了本书的范围。我们在 第 17.5.3 节nr_frags中看到了和。frags

For performance reasons, some skb information is stored in a separate structure that appears immediately after the skb in memory. This "shared info" (so called because it can be shared among copies of the skb within the networking code) must be accessed via the shinfo macro. There are several fields in this structure, but most of them are beyond the scope of this book. We saw nr_frags and frags in Section 17.5.3.

结构中的其余字段并不是特别有趣。它们用于维护缓冲区列表,说明属于拥有缓冲区的套接字的内存,等等。

The remaining fields in the structure are not particularly interesting. They are used to maintain lists of buffers, to account for memory belonging to the socket that owns the buffer, and so on.

作用于套接字缓冲区的函数

Functions Acting on Socket Buffers

网络设备使用sk_buff 结构体通过官方接口函数对其进行操作。许多函数在套接字缓冲区上运行;以下是最有趣的:

Network devices that use an sk_buff structure act on it by means of the official interface functions. Many functions operate on socket buffers; here are the most interesting ones:

struct sk_buff *alloc_skb(unsigned int len, int priority);

struct sk_buff *dev_alloc_skb(unsigned int len);
struct sk_buff *alloc_skb(unsigned int len, int priority);

struct sk_buff *dev_alloc_skb(unsigned int len);

分配一个缓冲区。alloc_skb 函数分配一个缓冲区并将 和 skb->data初始化skb->tailskb->head。dev_alloc_skb 函数是优先调用 alloc_skb 的快捷方式,并在和 之间保留一些空间。该数据空间用于网络层内的优化,驱动程序不应触及。GFP_ATOMICskb->headskb->data

Allocate a buffer. The alloc_skb function allocates a buffer and initializes both skb->data and skb->tail to skb->head. The dev_alloc_skb function is a shortcut that calls alloc_skb with GFP_ATOMIC priority and reserves some space between skb->head and skb->data. This data space is used for optimizations within the network layer and should not be touched by the driver.

void kfree_skb(struct sk_buff *skb);

void dev_kfree_skb(struct sk_buff *skb);

void dev_kfree_skb_irq(struct sk_buff *skb);

void dev_kfree_skb_any(struct sk_buff *skb);
void kfree_skb(struct sk_buff *skb);

void dev_kfree_skb(struct sk_buff *skb);

void dev_kfree_skb_irq(struct sk_buff *skb);

void dev_kfree_skb_any(struct sk_buff *skb);

释放一个缓冲区。kfree_skb调用由内核内部使用驱动程序应使用dev_kfree_skb的一种形式: dev_kfree_skb用于非中断上下文, dev_kfree_skb_irq用于中断上下文,或 dev_kfree_skb_any用于可以在任一上下文中运行的代码。

Free a buffer. The kfree_skb call is used internally by the kernel. A driver should use one of the forms of dev_kfree_skb instead: dev_kfree_skb for noninterrupt context, dev_kfree_skb_irq for interrupt context, or dev_kfree_skb_any for code that can run in either context.

unsigned char *skb_put(struct sk_buff *skb, int len);

unsigned char *_ _skb_put(struct sk_buff *skb, int len);
unsigned char *skb_put(struct sk_buff *skb, int len);

unsigned char *_ _skb_put(struct sk_buff *skb, int len);

更新结构体的tail和字段;它们用于将数据添加到缓冲区的末尾。每个函数的返回值都是之前的值(换句话说,它指向刚刚创建的数据空间)。驱动程序可以使用返回值通过调用、 或 等效函数来复制数据。这两个函数之间的区别在于 skb_put检查以确保数据适合缓冲区,而_ _skb_put忽略该检查。lensk_buffskb->tailmemcpy(skb_put(...)datalen)

Update the tail and len fields of the sk_buff structure; they are used to add data to the end of the buffer. Each function's return value is the previous value of skb->tail (in other words, it points to the data space just created). Drivers can use the return value to copy data by invoking memcpy(skb_put(...), data, len) or an equivalent. The difference between the two functions is that skb_put checks to be sure that the data fits in the buffer, whereas _ _skb_put omits the check.

unsigned char *skb_push(struct sk_buff *skb, int len);

unsigned char *_ _skb_push(struct sk_buff *skb, int len);
unsigned char *skb_push(struct sk_buff *skb, int len);

unsigned char *_ _skb_push(struct sk_buff *skb, int len);

递减 skb->data和递增的函数skb->len它们与skb_put类似 ,只不过数据被添加到数据包的开头而不是结尾。返回值指向刚刚创建的数据空间。这些函数用于在传输数据包之前添加硬件标头。再次强调,_ _skb_push 的不同之处在于它不检查是否有足够的可用空间。

Functions to decrement skb->data and increment skb->len. They are similar to skb_put, except that data is added to the beginning of the packet instead of the end. The return value points to the data space just created. The functions are used to add a hardware header before transmitting a packet. Once again, _ _skb_push differs in that it does not check for adequate available space.

int skb_tailroom(struct sk_buff *skb);
int skb_tailroom(struct sk_buff *skb);

返回可用于将数据放入缓冲区的空间量。如果驱动程序向缓冲区放入的数据多于缓冲区所能容纳的数据,系统就会出现紧急情况。尽管您可能会反对printk足以标记错误,但内存损坏对系统非常有害,因此开发人员决定采取明确的行动。实际上,如果缓冲区已正确分配,则不需要检查可用空间。由于驱动程序通常在分配缓冲区之前获取数据包大小,因此只有严重损坏的驱动程序才会在缓冲区中放入过多的数据,并且恐慌可能会被视为应有的惩罚。

Returns the amount of space available for putting data in the buffer. If a driver puts more data into the buffer than it can hold, the system panics. Although you might object that a printk would be sufficient to tag the error, memory corruption is so harmful to the system that the developers decided to take definitive action. In practice, you shouldn't need to check the available space if the buffer has been correctly allocated. Since drivers usually get the packet size before allocating a buffer, only a severely broken driver puts too much data in the buffer, and a panic might be seen as due punishment.

int skb_headroom(struct sk_buff *skb);
int skb_headroom(struct sk_buff *skb);

返回 前面的可用空间量data,即可以“推送”到缓冲区的八位字节数。

Returns the amount of space available in front of data, that is, how many octets one can "push" to the buffer.

void skb_reserve(struct sk_buff *skb, int len);
void skb_reserve(struct sk_buff *skb, int len);

data和都递增tail。该函数可用于在填充缓冲区之前预留空间。大多数以太网接口在数据包前面保留两个字节;因此,IP 标头在 16 字节边界上对齐,位于 14 字节以太网标头之后。 snull也这样做,尽管第 17.6 节中没有显示该指令,以避免在此时引入额外的概念。

Increments both data and tail. The function can be used to reserve headroom before filling the buffer. Most Ethernet interfaces reserve two bytes in front of the packet; thus, the IP header is aligned on a 16-byte boundary, after a 14-byte Ethernet header. snull does this as well, although the instruction was not shown in Section 17.6 to avoid introducing extra concepts at that point.

unsigned char *skb_pull(struct sk_buff *skb, int len);
unsigned char *skb_pull(struct sk_buff *skb, int len);

从数据包头部删除数据。驱动程序不需要使用此函数,但为了完整性起见将其包含在此处。它会递减skb->len和递增skb->data;这就是从传入数据包的开头剥离硬件标头(以太网或等效标头)的方式。

Removes data from the head of the packet. The driver won't need to use this function, but it is included here for completeness. It decrements skb->len and increments skb->data; this is how the hardware header (Ethernet or equivalent) is stripped from the beginning of incoming packets.

int skb_is_nonlinear(struct sk_buff *skb);
int skb_is_nonlinear(struct sk_buff *skb);

如果此 skb 被分成多个片段以进行分散/聚集 I/O,则返回真值。

Returns a true value if this skb is separated into multiple fragments for scatter/gather I/O.

int skb_headlen(struct sk_buff *skb);
int skb_headlen(struct sk_buff *skb);

返回 skb 第一个段的长度(由 指向的部分 skb->data)。

Returns the length of the first segment of the skb (that part pointed to by skb->data).

void *kmap_skb_frag(skb_frag_t *frag);

void kunmap_skb_frag(void *vaddr);
void *kmap_skb_frag(skb_frag_t *frag);

void kunmap_skb_frag(void *vaddr);

如果您必须从内核内部直接访问非线性 skb 中的片段,这些函数会为您映射和取消映射它们。使用原子 kmap,因此一次不能映射多个片段。

If you must directly access fragments in a nonlinear skb from within the kernel, these functions map and unmap them for you. An atomic kmap is used, so you cannot have more than one fragment mapped at a time.

内核定义了几个作用于套接字缓冲区的其他函数,但它们旨在用于更高层 网络代码,驱动程序不需要它们。

The kernel defines several other functions that act on socket buffers, but they are meant to be used in higher layers of networking code, and the driver doesn't need them.

MAC地址解析

MAC Address Resolution

以太网通信的一个有趣问题是如何将 MAC 地址(接口的唯一硬件 ID)与 IP 号关联起来。大多数协议都有类似的问题,但我们在这里关注类似以太网的情况。我们试图提供问题的完整描述,因此我们展示了三种情况:ARP、没有 ARP 的以太网标头(例如plip)和非以太网标头。

An interesting issue with Ethernet communication is how to associate the MAC addresses (the interface's unique hardware ID) with the IP number. Most protocols have a similar problem, but we concentrate on the Ethernet-like case here. We try to offer a complete description of the issue, so we show three situations: ARP, Ethernet headers without ARP (such as plip), and non-Ethernet headers.

在以太网中使用 ARP

Using ARP with Ethernet

处理地址解析的常用方法是使用地址解析协议 (ARP)。幸运的是,ARP 由内核管理,以太网接口不需要做任何特殊的事情来支持 ARP。只要在打开时正确分配dev->addrdev->addr_len,驱动程序就无需担心将 IP 号解析为 MAC 地址;ether_setup将正确的设备方法分配给dev->hard_headerdev->rebuild_header

The usual way to deal with address resolution is by using the Address Resolution Protocol (ARP). Fortunately, ARP is managed by the kernel, and an Ethernet interface doesn't need to do anything special to support ARP. As long as dev->addr and dev->addr_len are correctly assigned at open time, the driver doesn't need to worry about resolving IP numbers to MAC addresses; ether_setup assigns the correct device methods to dev->hard_header and dev->rebuild_header.

尽管内核通常处理地址解析的细节(以及结果的缓存),但它会调用接口驱动程序来帮助构建数据包。毕竟,驱动程序了解物理层标头的详细信息,而网络代码的作者则试图将内核的其余部分与这些知识隔离开来。为此,内核调用驱动程序的hard_header 方法用 ARP 查询的结果来布置数据包。通常,以太网驱动程序编写者不需要了解此过程 - 通用以太网代码会处理所有事情。

Although the kernel normally handles the details of address resolution (and caching of the results), it calls upon the interface driver to help in the building of the packet. After all, the driver knows about the details of the physical layer header, while the authors of the networking code have tried to insulate the rest of the kernel from that knowledge. To this end, the kernel calls the driver's hard_header method to lay out the packet with the results of the ARP query. Normally, Ethernet driver writers need not know about this process—the common Ethernet code takes care of everything.

覆盖 ARP

Overriding ARP

简单的点对点网络接口(例如 plip)可能会受益于使用以太网标头,同时避免来回发送 ARP 数据包的开销。snull中的示例代码 也属于此类网络设备。 snull无法使用 ARP,因为驱动程序会更改正在传输的数据包中的 IP 地址,并且 ARP 数据包也会交换 IP 地址。虽然我们可以轻松地实现一个简单的 ARP 回复生成器,但展示如何直接处理物理层标头更具说明性。

Simple point-to-point network interfaces, such as plip, might benefit from using Ethernet headers, while avoiding the overhead of sending ARP packets back and forth. The sample code in snull also falls into this class of network devices. snull cannot use ARP because the driver changes IP addresses in packets being transmitted, and ARP packets exchange IP addresses as well. Although we could have implemented a simple ARP reply generator with little trouble, it is more illustrative to show how to handle physical-layer headers directly.

如果您的设备想要使用常用的硬件标头而不运行 ARP,则需要覆盖默认的dev->hard_header方法。这就是 snull 的实现方式,作为一个非常短的函数:

If your device wants to use the usual hardware header without running ARP, you need to override the default dev->hard_header method. This is how snull implements it, as a very short function:

int snull_header(结构 sk_buff *skb, 结构 net_device *dev,
                无符号短类型、void *daddr、void *saddr、
                无符号整型长度)
{
    struct ethhdr *eth = (struct ethhdr *)skb_push(skb,ETH_HLEN);

    eth->h_proto = htons(类型);
    memcpy(eth->h_source, Saddr ? Saddr : dev->dev_addr, dev->addr_len);
    memcpy(eth->h_dest,daddr?daddr:dev->dev_addr,dev->addr_len);
    eth->h_dest[ETH_ALEN-1] ^= 0x01; /* dest 是我们异或 1 */
    返回(dev->hard_header_len);
}
int snull_header(struct sk_buff *skb, struct net_device *dev,
                unsigned short type, void *daddr, void *saddr,
                unsigned int len)
{
    struct ethhdr *eth = (struct ethhdr *)skb_push(skb,ETH_HLEN);

    eth->h_proto = htons(type);
    memcpy(eth->h_source, saddr ? saddr : dev->dev_addr, dev->addr_len);
    memcpy(eth->h_dest,   daddr ? daddr : dev->dev_addr, dev->addr_len);
    eth->h_dest[ETH_ALEN-1]   ^= 0x01;   /* dest is us xor 1 */
    return (dev->hard_header_len);
}

该函数只是获取内核提供的信息并将其格式化为标准以太网标头。由于稍后描述的原因,它还会切换目标以太网地址中的一位。

The function simply takes the information provided by the kernel and formats it into a standard Ethernet header. It also toggles a bit in the destination Ethernet address, for reasons described later.

当接口接收到数据包时, eth_type_trans以多种方式使用硬件标头。我们已经在 snull_rx中看到了这个调用:

When a packet is received by the interface, the hardware header is used in a couple of ways by eth_type_trans. We have already seen this call in snull_rx:

skb->协议 = eth_type_trans(skb, dev);
skb->protocol = eth_type_trans(skb, dev);

ETH_P_IP该函数从以太网标头中提取协议标识符(在本例中为 );它还分配skb->mac.raw、从数据包数据中删除硬件标头(使用skb_pull)并设置skb->pkt_type。最后一项默认为PACKET_HOSTat skbAllocation(表示数据包定向到该主机),并且 eth_type_trans更改它以反映以太网目标地址:如果该地址与接收它的接口的地址不匹配,则设置该 pkt_type字段到PACKET_OTHERHOST. 随后,除非接口处于混杂模式或在内核中启用数据包转发,否则netif_rx会丢弃任何类型的数据包PACKET_OTHERHOST。因此, snull_header会小心地使目标硬件地址与“接收”接口的硬件地址相匹配。

The function extracts the protocol identifier (ETH_P_IP, in this case) from the Ethernet header; it also assigns skb->mac.raw, removes the hardware header from packet data (with skb_pull), and sets skb->pkt_type. This last item defaults to PACKET_HOST at skb allocation (which indicates that the packet is directed to this host), and eth_type_trans changes it to reflect the Ethernet destination address: if that address does not match the address of the interface that received it, the pkt_type field is set to PACKET_OTHERHOST. Subsequently, unless the interface is in promiscuous mode or packet forwarding is enabled in the kernel, netif_rx drops any packet of type PACKET_OTHERHOST. For this reason, snull_header is careful to make the destination hardware address match that of the "receiving" interface.

如果您的接口是点对点链路,您将不希望收到意外的多播数据包。为了避免此问题,请记住,第一个八位字节具有0最低有效位 (LSB) 的目标地址定向到单个主机(即,它是 或PACKET_HOSTPACKET_OTHERHOSTplip驱动程序 使用0xfc作为其硬件地址的第一个八位字节,而snull使用0x00. 这两个地址都会产生类似以太网的工作点对点链路。

If your interface is a point-to-point link, you won't want to receive unexpected multicast packets. To avoid this problem, remember that a destination address whose first octet has 0 as the least significant bit (LSB) is directed to a single host (i.e., it is either PACKET_HOST or PACKET_OTHERHOST). The plip driver uses 0xfc as the first octet of its hardware address, while snull uses 0x00. Both addresses result in a working Ethernet-like point-to-point link.

非以太网标头

Non-Ethernet Headers

刚才我们看到硬件头除了目的地址之外还包含一些信息,其中最重要的是通信协议。我们现在描述如何使用硬件标头来封装相关信息。如果您需要了解详细信息,可以从内核源代码或特定传输介质的技术文档中提取它们。大多数驱动程序编写者都可以忽略此讨论并仅使用以太网实现。

We have just seen that the hardware header contains some information in addition to the destination address, the most important being the communication protocol. We now describe how hardware headers can be used to encapsulate relevant information. If you need to know the details, you can extract them from the kernel sources or the technical documentation for the particular transmission medium. Most driver writers are able to ignore this discussion and just use the Ethernet implementation.

值得注意的是,并非每个协议都必须提供所有信息。诸如plipsnull之类的点对点链路可以避免传输整个以太网报头而不失通用性。前面显示的由 snull_header实现的Hard_header 设备方法从内核接收传递信息(协议级和硬件地址)它还接收 参数中的 16 位协议号;例如,IP 的标识是 typeETH_P_IP。驱动程序应正确地将数据包数据和协议号传送到接收主机。点对点链路可以省略其硬件标头中的地址,仅传输协议号,因为保证传输独立于源地址和目标地址。仅 IP 链路甚至可以避免传输任何硬件标头。

It's worth noting that not all information has to be provided by every protocol. A point-to-point link such as plip or snull could avoid transferring the whole Ethernet header without losing generality. The hard_header device method, shown earlier as implemented by snull_header, receives the delivery information—both protocol-level and hardware addresses—from the kernel. It also receives the 16-bit protocol number in the type argument; IP, for example, is identified by ETH_P_IP. The driver is expected to correctly deliver both the packet data and the protocol number to the receiving host. A point-to-point link could omit addresses from its hardware header, transferring only the protocol number, because delivery is guaranteed independent of the source and destination addresses. An IP-only link could even avoid transmitting any hardware header whatsoever.

当数据包在链路的另一端被拾取时,驱动程序中的接收函数应正确设置字段skb->protocolskb->pkt_typeskb->mac.raw

When the packet is picked up at the other end of the link, the receiving function in the driver should correctly set the fields skb->protocol, skb->pkt_type, and skb->mac.raw.

skb->mac.raw是在网络代码的高层(例如net/ipv4/arp.c)中实现的地址解析机制使用的 char 指针。它必须指向匹配的机器地址dev->type设备类型的可能值在<linux/if_arp.h>中定义;以太网接口使用ARPHRD_ETHER。例如,以下是 eth_type_trans如何处理接收到的数据包的以太网标头:

skb->mac.raw is a char pointer used by the address-resolution mechanism implemented in higher layers of the networking code (for instance, net/ipv4/arp.c). It must point to a machine address that matches dev->type. The possible values for the device type are defined in <linux/if_arp.h>; Ethernet interfaces use ARPHRD_ETHER. For example, here is how eth_type_trans deals with the Ethernet header for received packets:

skb->mac.raw = skb->数据;
skb_pull(skb, dev->hard_header_len);
skb->mac.raw = skb->data;
skb_pull(skb, dev->hard_header_len);

在最简单的情况下(没有标头的点对点链接),skb->mac.raw可以指向包含此接口的硬件地址的静态缓冲区,protocol可以设置为ETH_P_IP,并且packet_type 可以保留其默认值PACKET_HOST

In the simplest case (a point-to-point link with no headers), skb->mac.raw can point to a static buffer containing the hardware address of this interface, protocol can be set to ETH_P_IP, and packet_type can be left with its default value of PACKET_HOST.

因为每种硬件类型都是 独一无二,很难给出比已经讨论的更具体的建议。然而,内核充满了示例。例如,请参阅 AppleTalk 驱动程序 ( drivers/net/appletalk/cops.c )、红外驱动程序(例如drivers/net/irda/smc_ircc.c)或 PPP 驱动程序 ( drivers/net/ppp_generic.c ) 。

Because every hardware type is unique, it is hard to give more specific advice than already discussed. The kernel is full of examples, however. See, for example, the AppleTalk driver (drivers/net/appletalk/cops.c), the infrared drivers (such as drivers/net/irda/smc_ircc.c), or the PPP driver (drivers/net/ppp_generic.c).

自定义 ioctl 命令

Custom ioctl Commands

我们已经看到ioctl系统调用是针对socket实现的;SIOCSIFADDRSIOCSIFMAP是“socket ioctl ”的示例。现在让我们看看网络代码如何使用系统调用的第三个参数。

We have seen that the ioctl system call is implemented for sockets; SIOCSIFADDR and SIOCSIFMAP are examples of "socket ioctls." Now let's see how the third argument of the system call is used by networking code.

当在套接字上调用ioctl系统调用时,命令号是<linux/sockios.h>中定义的符号之一,并且sock_ioctl函数直接调用特定于协议的函数(其中“协议”指的是主网络)使用的协议,例如 IP 或 AppleTalk)。

When the ioctl system call is invoked on a socket, the command number is one of the symbols defined in <linux/sockios.h>, and the sock_ioctl function directly invokes a protocol-specific function (where "protocol" refers to the main network protocol being used, for example, IP or AppleTalk).

任何协议层无法识别的ioctl命令都会传递到设备层。这些与设备相关的 ioctl命令接受来自用户空间的第三个参数,即struct ifreq *. 该结构在<linux/if.h>中定义 。和命令实际上作用于结构SIOCSIFADDR。的额外参数 虽然定义为,但只是 的一个字段。SIOCSIFMAPifreqSIOCSIFMAPifmapifreq

Any ioctl command that is not recognized by the protocol layer is passed to the device layer. These device-related ioctl commands accept a third argument from user space, a struct ifreq *. This structure is defined in <linux/if.h>. The SIOCSIFADDR and SIOCSIFMAP commands actually work on the ifreq structure. The extra argument to SIOCSIFMAP, although defined as ifmap, is just a field of ifreq.

除了使用标准化调用之外,每个接口还可以定义自己的 ioctl命令。例如, plip 接口允许接口通过 ioctl 修改其内部超时套接字的 ioctl实现将 16 个命令识别为接口私有的:SIOCDEVPRIVATE通过SIOCDEVPRIVATE+15. [ 2 ]

In addition to using the standardized calls, each interface can define its own ioctl commands. The plip interface, for example, allows the interface to modify its internal timeout values via ioctl. The ioctl implementation for sockets recognizes 16 commands as private to the interface: SIOCDEVPRIVATE through SIOCDEVPRIVATE+15.[2]

当这些命令之一被识别时,dev->do_ioctl将在相关接口驱动程序中调用。struct ifreq *该函数接收通用 ioctl函数使用的相同指针:

When one of these commands is recognized, dev->do_ioctl is called in the relevant interface driver. The function receives the same struct ifreq * pointer that the general-purpose ioctl function uses:

int (*do_ioctl)(struct net_device *dev, struct ifreq *ifr, int cmd);
int (*do_ioctl)(struct net_device *dev, struct ifreq *ifr, int cmd);

ifr指针指向内核空间地址,该地址保存用户传递的结构的副本。do_ioctl返回后 ,该结构体被复制回用户空间;因此,驱动程序可以使用私有命令来接收和返回数据。

The ifr pointer points to a kernel-space address that holds a copy of the structure passed by the user. After do_ioctl returns, the structure is copied back to user space; Therefore, the driver can use the private commands to both receive and return data.

设备特定的命令可以选择使用 中的字段struct ifreq,但它们已经传达了标准化的含义,并且驱动程序不太可能使结构适应其需要。该字段ifr_data是一个caddr_t项目(指针),旨在用于特定于设备的需求。用于调用其ioctl命令的驱动程序和程序 应该就ifr_data. 例如,pppstats使用特定于设备的命令从ppp接口驱动程序检索信息。

The device-specific commands can choose to use the fields in struct ifreq, but they already convey a standardized meaning, and it's unlikely that the driver can adapt the structure to its needs. The field ifr_data is a caddr_t item (a pointer) that is meant to be used for device-specific needs. The driver and the program used to invoke its ioctl commands should agree about the use of ifr_data. For example, pppstats uses device-specific commands to retrieve information from the ppp interface driver.

不值得在这里展示do_ioctl的实现,但是有了本章中的信息和内核示例,您应该能够在需要时编写一个实现。但请注意,plip实现使用 不正确,不应用作ioctlifr_data实现的示例。

It's not worth showing an implementation of do_ioctl here, but with the information in this chapter and the kernel examples, you should be able to write one when you need it. Note, however, that the plip implementation uses ifr_data incorrectly and should not be used as an example for an ioctl implementation.

统计信息

Statistical Information

驱动程序需要的最后一个方法是 get_stats。此方法返回指向设备统计信息的指针。它的实现非常简单;即使多个接口由同一驱动程序管理,所示的接口也能工作,因为统计信息托管在设备数据结构中。

The last method a driver needs is get_stats. This method returns a pointer to the statistics for the device. Its implementation is pretty easy; the one shown works even when several interfaces are managed by the same driver, because the statistics are hosted within the device data structure.

结构 net_device_stats *snull_stats(结构 net_device *dev)
{
    struct snull_priv *priv = netdev_priv(dev);
    返回&priv->统计信息;
}
struct net_device_stats *snull_stats(struct net_device *dev)
{
    struct snull_priv *priv = netdev_priv(dev);
    return &priv->stats;
}

返回有意义的统计数据所需的实际工作分布在整个驱动程序中,其中更新了各个字段。以下列表显示了 中最有趣的字段 struct net_device_stats

The real work needed to return meaningful statistics is distributed throughout the driver, where the various fields are updated. The following list shows the most interesting fields in struct net_device_stats:

unsigned long rx_packets;

unsigned long tx_packets;
unsigned long rx_packets;

unsigned long tx_packets;

接口成功传输的传入和传出数据包的总数。

The total number of incoming and outgoing packets successfully transferred by the interface.

unsigned long rx_bytes;

unsigned long tx_bytes;
unsigned long rx_bytes;

unsigned long tx_bytes;

接口接收和发送的字节数。

The number of bytes received and transmitted by the interface.

unsigned long rx_errors;

unsigned long tx_errors;
unsigned long rx_errors;

unsigned long tx_errors;

错误接收和传输的数量。数据包传输中可能出现的错误数不胜数,该net_device_stats结构包括六个用于特定接收错误的计数器和五个用于传输错误的计数器。完整列表请参见 <linux/netdevice.h> 。如果可能,您的驱动程序应该维护详细的错误统计信息,因为它们对于试图追踪问题的系统管理员最有帮助。

The number of erroneous receptions and transmissions. There's no end of things that can go wrong with packet transmission, and the net_device_stats structure includes six counters for specific receive errors and five for transmit errors. See <linux/netdevice.h> for the full list. If possible, your driver should maintain detailed error statistics, because they can be most helpful to system administrators trying to track down a problem.

unsigned long rx_dropped;

unsigned long tx_dropped;
unsigned long rx_dropped;

unsigned long tx_dropped;

接收和发送期间丢弃的数据包数量。当没有可用于数据包数据的内存时,数据包就会被丢弃。tx_dropped很少使用。

The number of packets dropped during reception and transmission. Packets are dropped when there's no memory available for packet data. tx_dropped is rarely used.

unsigned long collisions;
unsigned long collisions;

由于介质拥塞而导致的冲突数。

The number of collisions due to congestion on the medium.

unsigned long multicast;
unsigned long multicast;

收到的多播数据包的数量。

The number of multicast packets received.

值得重复的是,即使接口关闭,也可以随时调用get_stats方法,因此只要该结构存在,驱动程序就必须保留统计信息net_device

It is worth repeating that the get_stats method can be called at any time—even when the interface is down—so the driver must retain statistical information for as long as the net_device structure exists.

组播

Multicast

多播数据包是 网络数据包意味着由多个主机接收,但不是由所有主机接收。此功能是通过向主机组分配特殊硬件地址来获得的。定向到特殊地址之一的数据包应被该组中的所有主机接收。在以太网的情况下,多播地址在目标地址中设置了第一个地址八位字节的最低有效位,而每个设备板在其自己的硬件地址中都清除了该位。

A multicast packet is a network packet meant to be received by more than one host, but not by all hosts. This functionality is obtained by assigning special hardware addresses to groups of hosts. Packets directed to one of the special addresses should be received by all the hosts in that group. In the case of Ethernet, a multicast address has the least significant bit of the first address octet set in the destination address, while every device board has that bit clear in its own hardware address.

处理主机组和硬件地址的棘手部分是由应用程序和内核执行的,接口驱动程序不需要处理这些问题。

The tricky part of dealing with host groups and hardware addresses is performed by applications and the kernel, and the interface driver doesn't need to deal with these problems.

多播数据包的传输是一个简单的问题,因为它们看起来与任何其他数据包完全相同。接口通过通信介质传输它们,而不查看目标地址。内核必须分配正确的硬件目标地址;如果定义了hard_header设备方法,则不需要查看它排列的数据

Transmission of multicast packets is a simple problem because they look exactly like any other packets. The interface transmits them over the communication medium without looking at the destination address. It's the kernel that has to assign a correct hardware destination address; the hard_header device method, if defined, doesn't need to look in the data it arranges.

内核处理在任何给定时间跟踪感兴趣的多播地址的工作。该列表可能会经常更改,因为它是在任何给定时间运行的应用程序和用户兴趣的函数。驱动程序的工作是接受感兴趣的多播地址列表并将发送到这些地址的任何数据包传送到内核。驱动程序如何实现多播列表在一定程度上取决于底层硬件的工作方式。通常,就多播而言,硬件属于三类之一:

The kernel handles the job of tracking which multicast addresses are of interest at any given time. The list can change frequently, since it is a function of the applications that are running at any given time and the users' interest. It is the driver's job to accept the list of interesting multicast addresses and deliver to the kernel any packets sent to those addresses. How the driver implements the multicast list is somewhat dependent on how the underlying hardware works. Typically, hardware belongs to one of three classes, as far as multicast is concerned:

  • 无法处理多播的接口。这些接口要么接收专门定向到其硬件地址的数据包(加上广播数据包),要么接收每个数据包。它们只能通过接收每个数据包来接收多播数据包,因此,可能会因大量“无趣”数据包而使操作系统不堪重负。您通常不会将这些接口视为具有多播功能,并且驱动程序不会IFF_MULTICASTdev->flags.

  • Interfaces that cannot deal with multicast. These interfaces either receive packets directed specifically to their hardware address (plus broadcast packets) or receive every packet. They can receive multicast packets only by receiving every packet, thus, potentially overwhelming the operating system with a huge number of "uninteresting" packets. You don't usually count these interfaces as multicast capable, and the driver won't set IFF_MULTICAST in dev->flags.

  • 点对点接口是一种特殊情况,因为它们总是接收每个数据包而不执行任何硬件过滤。

  • Point-to-point interfaces are a special case because they always receive every packet without performing any hardware filtering.

  • 可以区分多播数据包和其他数据包(主机到主机或广播)的接口。可以指示这些接口接收每个多播数据包,并让软件确定该主机是否对该地址感兴趣。在这种情况下引入的开销是可以接受的,因为典型网络上的多播数据包的数量很少。

  • Interfaces that can tell multicast packets from other packets (host-to-host or broadcast). These interfaces can be instructed to receive every multicast packet and let the software determine if the address is interesting for this host. The overhead introduced in this case is acceptable, because the number of multicast packets on a typical network is low.

  • 可以对组播地址进行硬件检测的接口。可以向这些接口传递要接收数据包的多播地址列表,并忽略其他多播数据包。这是内核的最佳情况,因为它不会浪费处理器时间来丢弃接口接收到的“无趣”数据包。

  • Interfaces that can perform hardware detection of multicast addresses. These interfaces can be passed a list of multicast addresses for which packets are to be received, and ignore other multicast packets. This is the optimal case for the kernel, because it doesn't waste processor time dropping "uninteresting" packets received by the interface.

内核试图通过支持第三种设备类来利用高级接口的功能,第三种设备类是最通用的,处于最佳状态。因此,每当有效多播地址列表发生更改时,内核都会通知驱动程序,并将新列表传递给驱动程序,以便驱动程序可以根据新信息更新硬件过滤器。

The kernel tries to exploit the capabilities of high-level interfaces by supporting the third device class, which is the most versatile, at its best. Therefore, the kernel notifies the driver whenever the list of valid multicast addresses is changed, and it passes the new list to the driver so it can update the hardware filter according to the new information.

内核对多播的支持

Kernel Support for Multicasting

支持组播数据包由几个项目组成:设备方法、数据结构和设备标志:

Support for multicast packets is made up of several items: a device method, a data structure, and device flags:

void (*dev->set_multicast_list)(struct net_device *dev);
void (*dev->set_multicast_list)(struct net_device *dev);

每当与设备关联的机器地址列表发生更改时,就会调用设备方法。dev->flags修改时也会调用它,因为某些标志(例如IFF_PROMISC)可能还需要您对硬件过滤器重新编程。该方法接收 的指针struct net_device作为参数并返回 void。对实现此方法不感兴趣的驱动程序可以将该字段设置为NULL

Device method called whenever the list of machine addresses associated with the device changes. It is also called when dev->flags is modified, because some flags (e.g., IFF_PROMISC) may also require you to reprogram the hardware filter. The method receives a pointer to struct net_device as an argument and returns void. A driver not interested in implementing this method can leave the field set to NULL.

struct dev_mc_list *dev->mc_list;
struct dev_mc_list *dev->mc_list;

与设备关联的所有多播地址的链表。该结构的实际定义在本节末尾介绍。

A linked list of all the multicast addresses associated with the device. The actual definition of the structure is introduced at the end of this section.

int dev->mc_count;
int dev->mc_count;

链接列表中的项目数。此信息有些多余,但检查mc_count0检查列表的有用快捷方式。

The number of items in the linked list. This information is somewhat redundant, but checking mc_count against 0 is a useful shortcut for checking the list.

IFF_MULTICAST
IFF_MULTICAST

除非驱动程序在 中设置此标志dev->flags,否则不会要求接口处理多播数据包。尽管如此,内核在更改时会调用驱动程序的set_multicast_list方法dev->flags,因为多播列表可能在接口未激活时已更改。

Unless the driver sets this flag in dev->flags, the interface won't be asked to handle multicast packets. Nonetheless, the kernel calls the driver's set_multicast_list method when dev->flags changes, because the multicast list may have changed while the interface was not active.

IFF_ALLMULTI
IFF_ALLMULTI

由网络软件设置的标志dev->flags,告诉驱动程序从网络检索所有多播数据包。当启用多播路由时会发生这种情况。如果设置了该标志,dev->mc_list则不应用于过滤多播数据包。

Flag set in dev->flags by the networking software to tell the driver to retrieve all multicast packets from the network. This happens when multicast routing is enabled. If the flag is set, dev->mc_list shouldn't be used to filter multicast packets.

IFF_PROMISC
IFF_PROMISC

dev->flags当接口进入混杂模式时设置标志。每个数据包都应由接口接收,独立于dev->mc_list.

Flag set in dev->flags when the interface is put into promiscuous mode. Every packet should be received by the interface, independent of dev->mc_list.

驱动程序开发人员需要的最后一点信息是 的定义 struct dev_mc_list,它位于<linux/netdevice.h>中:

The last bit of information needed by the driver developer is the definition of struct dev_mc_list, which lives in <linux/netdevice.h>:

结构体 dev_mc_list {    
    struct dev_mc_list *下一个; /* 列表中的下一个地址 */
    _ _u8 dmi_addr[MAX_ADDR_LEN]; /* 硬件地址 */
    无符号字符 dmi_addrlen;/* 地址长度 */
    int dmi_users;/* 用户数量 */
    int dmi_gusers;/* 组数 */
};
struct dev_mc_list {    
    struct dev_mc_list   *next;          /* Next address in list */
    _ _u8                 dmi_addr[MAX_ADDR_LEN]; /* Hardware address */
    unsigned char        dmi_addrlen;    /* Address length */
    int                  dmi_users;      /* Number of users */
    int                  dmi_gusers;     /* Number of groups */
};

由于多播和硬件地址独立于数据包的实际传输,因此该结构可以跨网络实现移植,并且每个地址由一串八位位组和长度来标识,就像dev->dev_addr.

Because multicasting and hardware addresses are independent of the actual transmission of packets, this structure is portable across network implementations, and each address is identified by a string of octets and a length, just like dev->dev_addr.

典型实现

A Typical Implementation

最好的描述set_multicast_list的设计 是向您展示一些伪代码。

The best way to describe the design of set_multicast_list is to show you some pseudocode.

以下函数是全功能 ( ff) 驱动程序中该函数的典型实现。该驱动程序功能齐全,因为它控制的接口具有复杂的硬件数据包过滤器,该过滤器可以保存该主机要接收的多播地址表。表的最大尺寸为FF_TABLE_SIZE

The following function is a typical implementation of the function in a full-featured (ff) driver. The driver is full featured in that the interface it controls has a complex hardware packet filter, which can hold a table of multicast addresses to be received by this host. The maximum size of the table is FF_TABLE_SIZE.

所有以 为前缀的函数ff_都是特定于硬件操作的占位符:

All the functions prefixed with ff_ are placeholders for hardware-specific operations:

void ff_set_multicast_list(struct net_device *dev)
{
    结构 dev_mc_list *mcptr;

    if (dev->flags & IFF_PROMISC) {
        ff_get_all_packets();
        返回;
    }
    /* 如果地址多于我们处理的数量,则获取所有多播
    数据包并在软件中对它们进行分类。*/
    if (dev->flags & IFF_ALLMULTI || dev->mc_count > FF_TABLE_SIZE) {
        ff_get_all_multicast_packets();
        返回;
    }
    /* 没有多播?拿我们自己的东西*/
    if (dev->mc_count == 0) {
        ff_get_only_own_packets();
        返回;
    }
    /* 将所有组播地址存储在硬件过滤器中 */
    ff_clear_mc_list();
    for (mc_ptr = dev->mc_list; mc_ptr; mc_ptr = mc_ptr->next)
        ff_store_mc_address(mc_ptr->dmi_addr);
    ff_get_packets_in_multicast_list();
}
void ff_set_multicast_list(struct net_device *dev)
{
    struct dev_mc_list *mcptr;

    if (dev->flags & IFF_PROMISC) {
        ff_get_all_packets(  );
        return;
    }
    /* If there's more addresses than we handle, get all multicast
    packets and sort them out in software. */
    if (dev->flags & IFF_ALLMULTI || dev->mc_count > FF_TABLE_SIZE) {
        ff_get_all_multicast_packets(  );
        return;
    }
    /* No multicast?  Just get our own stuff */
    if (dev->mc_count =  = 0) {
        ff_get_only_own_packets(  );
        return;
    }
    /* Store all of the multicast addresses in the hardware filter */
    ff_clear_mc_list(  );
    for (mc_ptr = dev->mc_list; mc_ptr; mc_ptr = mc_ptr->next)
        ff_store_mc_address(mc_ptr->dmi_addr);
    ff_get_packets_in_multicast_list(  );
}

如果接口无法在硬件过滤器中存储传入数据包的多播表,则可以简化此实现。在这种情况下,FF_TABLE_SIZE减少为0,并且不需要最后四行代码。

This implementation can be simplified if the interface cannot store a multicast table in the hardware filter for incoming packets. In that case, FF_TABLE_SIZE reduces to 0, and the last four lines of code are not needed.

正如前面提到的,即使接口不能处理组播数据包,也需要实现set_multicast_list方法来通知dev->flags. 这种方法可以称为“非特色”(nonfeatured nf) 实现。实现非常简单,如下代码所示:

As was mentioned earlier, even interfaces that can't deal with multicast packets need to implement the set_multicast_list method to be notified about changes in dev->flags. This approach could be called a "nonfeatured" (nf) implementation. The implementation is very simple, as shown by the following code:

void nf_set_multicast_list(struct net_device *dev)
{
    if (dev->flags & IFF_PROMISC)
        nf_get_all_packets();
    别的
        nf_get_only_own_packets();
}
void nf_set_multicast_list(struct net_device *dev)
{
    if (dev->flags & IFF_PROMISC)
        nf_get_all_packets(  );
    else
        nf_get_only_own_packets(  );
}

实施IFF_PROMISC很重要,因为否则用户将无法运行tcpdump或任何其他网络分析器。另一方面,如果接口运行点对点链路,则根本不需要实现set_multicast_list,因为用户会收到每个 无论如何,数据包。

Implementing IFF_PROMISC is important, because otherwise the user won't be able to run tcpdump or any other network analyzers. If the interface runs a point-to-point link, on the other hand, there's no need to implement set_multicast_list at all, because users receive every packet anyway.

其他一些细节

A Few Other Details

本节涵盖网络驱动程序作者可能感兴趣的其他一些主题。在每种情况下,我们只是尝试为您指明正确的方向。要全面了解该主题可能还需要花费一些时间挖掘内核源代码。

This section covers a few other topics that may be of interest to network driver authors. In each case, we simply try to point you in the right direction. Obtaining a complete picture of the subject probably requires spending some time digging through the kernel source as well.

媒体独立接口支持

Media Independent Interface Support

媒体独立接口(或 MII)是一种 IEEE 802.3 标准,描述以太网收发器如何与网络控制器连接;市场上很多产品都符合这个接口。如果您正在为符合 MII 的控制器编写驱动程序,内核会导出一个通用的 MII 支持层,这可能会让您的生活更轻松。

Media Independent Interface (or MII) is an IEEE 802.3 standard describing how Ethernet transceivers can interface with network controllers; many products on the market conform with this interface. If you are writing a driver for an MII-compliant controller, the kernel exports a generic MII support layer that may make your life easier.

要使用通用 MII 层,您应该包含 <linux/mii.h>。您需要填写一个mii_if_info结构体,其中包含收发器的物理 ID、全双工是否有效等信息。还需要该结构体的两种方法mii_if_info

To use the generic MII layer, you should include <linux/mii.h>. You need to fill out an mii_if_info structure with information on the physical ID of the transceiver, whether full duplex is in effect, etc. Also required are two methods for the mii_if_info structure:

int (*mdio_read) (struct net_device *dev, int phy_id, int location);
void (*mdio_write) (struct net_device *dev, int phy_id, int location, int val);
int (*mdio_read) (struct net_device *dev, int phy_id, int location);
void (*mdio_write) (struct net_device *dev, int phy_id, int location, int val);

正如您所期望的,这些方法应该实现与您的特定 MII 接口的通信。

As you might expect, these methods should implement communications with your specific MII interface.

通用MII代码提供了一组用于查询和更改收发器操作模式的函数;其中许多旨在与 ethtool实用程序一起使用(在下一节中介绍)。有关详细信息,请参阅 <linux/mii.h>drivers/net/mii.c 。

The generic MII code provides a set of functions for querying and changing the operating mode of the transceiver; many of these are designed to work with the ethtool utility (described in the next section). Look in <linux/mii.h> and drivers/net/mii.c for the details.

以太工具支持

Ethtool Support

以太网工具 是一个实用程序,旨在为系统管理员提供对网络接口操作的大量控制。使用ethtool,可以控制各种接口参数,包括速度、媒体类型、双工操作、DMA 环设置、硬件校验和、LAN 唤醒操作等,但前提是驱动程序支持 ethtool Ethtool 可以从http://sf.net/projects/gkernel/下载。

Ethtool is a utility designed to give system administrators a great deal of control over the operation of network interfaces. With ethtool, it is possible to control various interface parameters including speed, media type, duplex operation, DMA ring setup, hardware checksumming, wake-on-LAN operation, etc., but only if ethtool is supported by the driver. Ethtool may be downloaded from http://sf.net/projects/gkernel/.

ethtool支持的相关声明可以在<linux/ethtool.h>中找到 。它的核心是 类型的结构 ,其中包含ethtoolethtool_ops支持的完整 24 种不同方法。这些方法大多数都相对简单;有关详细信息,请参阅<linux/ethtool.h> 。如果您的驱动程序使用 MII 层,则可以使用mii_ethtool_gsetmii_ethtool_sset分别实现get_settingsset_settings方法。

The relevant declarations for ethtool support may be found in <linux/ethtool.h>. At the core of it is a structure of type ethtool_ops, which contains a full 24 different methods for ethtool support. Most of these methods are relatively straightforward; see <linux/ethtool.h> for the details. If your driver is using the MII layer, you can use mii_ethtool_gset and mii_ethtool_sset to implement the get_settings and set_settings methods, respectively.

为了使ethtool能够与您的设备配合使用,您必须ethtool_ops在结构中放置一个指向您的结构的指针net_device。宏SET_ETHTOOL_OPS(在<linux/netdevice.h>中定义)应用于此目的。请注意,即使接口关闭,您的ethtool方法也可以被调用。

For ethtool to work with your device, you must place a pointer to your ethtool_ops structure in the net_device structure. The macro SET_ETHTOOL_OPS (defined in <linux/netdevice.h>) should be used for this purpose. Do note that your ethtool methods can be called even when the interface is down.

网络调查

Netpoll

Netpoll”是网络堆栈中相对较晚(2.6.5)的补充;其目的是使内核能够在完整网络和 I/O 子系统可能不可用的情况下发送和接收数据包。它用于功能例如远程网络控制台和远程内核调试。在驱动程序中支持 netpoll 无论如何都不是必要的,但它可能会使您的设备在某些情况下更有用。在大多数情况下支持 netpoll 也相对容易。

"Netpoll" is a relatively late (2.6.5) addition to the network stack; its purpose is to enable the kernel to send and receive packets in situations where the full network and I/O subsystems may not be available. It is used for features like remote network consoles and remote kernel debugging. Supporting netpoll in your driver is not, by any means, necessary, but it may make your device more useful in some situations. Supporting netpoll is also relatively easy in most cases.

实现 netpoll 的驱动程序应该实现poll_controller 方法。它的工作是在没有设备中断的情况下跟上控制器上可能发生的任何事情。几乎所有poll_controller方法都采用以下形式:

Drivers implementing netpoll should implement the poll_controller method. Its job is to keep up with anything that may be happening on the controller in the absence of device interrupts. Almost all poll_controller methods take the following form:

void my_poll_controller(struct net_device *dev)
{
    禁用设备中断(dev);
    call_interrupt_handler(dev->irq, dev, NULL);
    重新启用设备中断(dev);
}
void my_poll_controller(struct net_device *dev)
{
    disable_device_interrupts(dev);
    call_interrupt_handler(dev->irq, dev, NULL);
    reenable_device_interrupts(dev);
}

轮询控制器 方法本质上只是模拟来自给定设备的中断。

The poll_controller method, in essence, is simply simulating interrupts from the given device.

快速参考

Quick Reference

本节提供了一个 本章介绍的概念的参考。它还解释了驱动程序需要包含的每个头文件的作用。net_device不过,和结构中的字段列表sk_buff 此处不再重复。

This section provides a reference for the concepts introduced in this chapter. It also explains the role of each header file that a driver needs to include. The lists of fields in the net_device and sk_buff structures, however, are not repeated here.

#include <linux/netdevice.h>
#include <linux/netdevice.h>

struct net_device包含和的定义的标头struct net_device_stats,并包含网络驱动程序所需的一些其他标头。

Header that hosts the definitions of struct net_device and struct net_device_stats, and includes a few other headers that are needed by network drivers.

struct net_device *alloc_netdev(int sizeof_priv, char *name, void

(*setup)(struct net_device *);

struct net_device *alloc_etherdev(int sizeof_priv);

void free_netdev(struct net_device *dev);
struct net_device *alloc_netdev(int sizeof_priv, char *name, void

(*setup)(struct net_device *);

struct net_device *alloc_etherdev(int sizeof_priv);

void free_netdev(struct net_device *dev);

用于分配和释放net_device 结构的函数。

Functions for allocating and freeing net_device structures.

int register_netdev(struct net_device *dev);

void unregister_netdev(struct net_device *dev);
int register_netdev(struct net_device *dev);

void unregister_netdev(struct net_device *dev);

注册和取消注册网络设备。

Registers and unregisters a network device.

void *netdev_priv(struct net_device *dev);
void *netdev_priv(struct net_device *dev);

检索指向网络设备结构的驱动程序私有区域的指针的函数。

A function that retrieves the pointer to the driver-private area of a network device structure.

struct net_device_stats;
struct net_device_stats;

保存设备统计信息的结构。

A structure that holds device statistics.

netif_start_queue(struct net_device *dev);

netif_stop_queue(struct net_device *dev);

netif_wake_queue(struct net_device *dev);
netif_start_queue(struct net_device *dev);

netif_stop_queue(struct net_device *dev);

netif_wake_queue(struct net_device *dev);

控制数据包传递到驱动程序进行传输的函数。在调用 netif_start_queue之前不会传输任何数据包。netif_stop_queue暂停传输, netif_wake_queue重新启动队列并戳网络层重新开始传输数据包。

Functions that control the passing of packets to the driver for transmission. No packets are transmitted until netif_start_queue has been called. netif_stop_queue suspends transmission, and netif_wake_queue restarts the queue and pokes the network layer to restart transmitting packets.

skb_shinfo(struct sk_buff *skb);
skb_shinfo(struct sk_buff *skb);

提供对数据包缓冲区的“共享信息”部分的访问的宏。

A macro that provides access to the "shared info" portion of a packet buffer.

void netif_rx(struct sk_buff *skb);
void netif_rx(struct sk_buff *skb);

可以调用(包括在中断时)的函数来通知内核已收到数据包并将其封装到套接字缓冲区中。

Function that can be called (including at interrupt time) to notify the kernel that a packet has been received and encapsulated into a socket buffer.

void netif_rx_schedule(dev);
void netif_rx_schedule(dev);

通知内核数据包可用并且应在接口上启动轮询的功能;它仅由符合 NAPI 的驱动程序使用。

Function that informs the kernel that packets are available and that polling should be started on the interface; it is used only by NAPI-compliant drivers.

int netif_receive_skb(struct sk_buff *skb);

void netif_rx_complete(struct net_device *dev);
int netif_receive_skb(struct sk_buff *skb);

void netif_rx_complete(struct net_device *dev);

仅应由符合 NAPI 的驱动程序使用的函数。 netif_receive_skb是相当于netif_rx的 NAPI ;它将一个数据包送入内核。当 NAPI 兼容的驱动程序耗尽了接收到的数据包的供应时,它应该重新启用中断,并调用netif_rx_complete来停止轮询。

Functions that should be used only by NAPI-compliant drivers. netif_receive_skb is the NAPI equivalent to netif_rx; it feeds a packet into the kernel. When a NAPI-compliant driver has exhausted the supply of received packets, it should reenable interrupts, and call netif_rx_complete to stop polling.

#include <linux/if.h>
#include <linux/if.h>

该文件包含在netdevice.h中,声明了接口标志(IFF_宏)和,它在网络驱动程序的ioctlstruct ifmap实现中发挥着重要作用 。

Included by netdevice.h, this file declares the interface flags (IFF_ macros) and struct ifmap, which has a major role in the ioctl implementation for network drivers.

void netif_carrier_off(struct net_device *dev);

void netif_carrier_on(struct net_device *dev);

int netif_carrier_ok(struct net_device *dev);
void netif_carrier_off(struct net_device *dev);

void netif_carrier_on(struct net_device *dev);

int netif_carrier_ok(struct net_device *dev);

前两个函数可用于告诉内核给定接口上当前是否存在载波信号。netif_carrier_ok测试设备结构中反映的载体状态。

The first two functions may be used to tell the kernel whether a carrier signal is currently present on the given interface. netif_carrier_ok tests the carrier state as reflected in the device structure.

#include <linux/if_ether.h>

ETH_ALEN

ETH_P_IP

struct ethhdr;
#include <linux/if_ether.h>

ETH_ALEN

ETH_P_IP

struct ethhdr;

if_ether.h包含在netdevice.h中,定义了用于表示八位字节长度(例如地址长度)和网络协议(例如 IP)的所有宏。它还定义了结构。ETH_ethhdr

Included by netdevice.h, if_ether.h defines all the ETH_ macros used to represent octet lengths (such as the address length) and network protocols (such as IP). It also defines the ethhdr structure.

#include <linux/skbuff.h>
#include <linux/skbuff.h>

相关结构的定义struct sk_buff,以及作用于缓冲区的几个内联函数。该标头包含在netdevice.h中。

The definition of struct sk_buff and related structures, as well as several inline functions to act on the buffers. This header is included by netdevice.h.

struct sk_buff *alloc_skb(unsigned int len, int priority);

struct sk_buff *dev_alloc_skb(unsigned int len);

void kfree_skb(struct sk_buff *skb);

void dev_kfree_skb(struct sk_buff *skb);

void dev_kfree_skb_irq(struct sk_buff *skb);

void dev_kfree_skb_any(struct sk_buff *skb);
struct sk_buff *alloc_skb(unsigned int len, int priority);

struct sk_buff *dev_alloc_skb(unsigned int len);

void kfree_skb(struct sk_buff *skb);

void dev_kfree_skb(struct sk_buff *skb);

void dev_kfree_skb_irq(struct sk_buff *skb);

void dev_kfree_skb_any(struct sk_buff *skb);

处理套接字缓冲区的分配和释放的函数。驱动程序通常应使用dev_专门用于此目的的变体。

Functions that handle the allocation and freeing of socket buffers. Drivers should normally use the dev_ variants, which are intended for that purpose.

unsigned char *skb_put(struct sk_buff *skb, int len);

unsigned char *_ _skb_put(struct sk_buff *skb, int len);

unsigned char *skb_push(struct sk_buff *skb, int len);

unsigned char *_ _skb_push(struct sk_buff *skb, int len);
unsigned char *skb_put(struct sk_buff *skb, int len);

unsigned char *_ _skb_put(struct sk_buff *skb, int len);

unsigned char *skb_push(struct sk_buff *skb, int len);

unsigned char *_ _skb_push(struct sk_buff *skb, int len);

将数据添加到skb; 的函数 skb_put将数据放在末尾skb,而skb_push将数据放在开头。常规版本执行检查以确保有足够的可用空间;双下划线版本将这些测试排除在外。

Functions that add data to an skb; skb_put puts the data at the end of the skb, while skb_push puts it at the beginning. The regular versions perform checking to ensure that adequate space is available; double-underscore versions leave those tests out.

int skb_headroom(struct sk_buff *skb);

int skb_tailroom(struct sk_buff *skb);

void skb_reserve(struct sk_buff *skb, int len);
int skb_headroom(struct sk_buff *skb);

int skb_tailroom(struct sk_buff *skb);

void skb_reserve(struct sk_buff *skb, int len);

skb. skb_headroomskb_tailroom分别告诉 . 的开头和结尾有多少可用空间skbskb_reserve可用于在 an 的开头保留空间 skb,该空间必须为空。

Functions that perform management of space within an skb. skb_headroom and skb_tailroom tell how much space is available at the beginning and end, respectively, of an skb. skb_reserve may be used to reserve space at the beginning of an skb, which must be empty.

unsigned char *skb_pull(struct sk_buff *skb, int len);
unsigned char *skb_pull(struct sk_buff *skb, int len);

skb_pullskb通过调整内部指针从 an 中“删除”数据

skb_pull "removes" data from an skb by adjusting the internal pointers.

int skb_is_nonlinear(struct sk_buff *skb);
int skb_is_nonlinear(struct sk_buff *skb);

如果此 skb 被分成多个片段以进行分散/聚集 I/O,则返回真值的函数。

Function that returns a true value if this skb is separated into multiple fragments for scatter/gather I/O.

int skb_headlen(struct sk_buff *skb);
int skb_headlen(struct sk_buff *skb);

返回 skb 第一个段的长度——由 指向的部分 skb->data

Returns the length of the first segment of the skb—that part pointed to by skb->data.

void *kmap_skb_frag(skb_frag_t *frag);

void kunmap_skb_frag(void *vaddr);
void *kmap_skb_frag(skb_frag_t *frag);

void kunmap_skb_frag(void *vaddr);

提供对非线性 skb 内片段的直接访问的函数。

Functions that provide direct access to fragments within a nonlinear skb.

#include <linux/etherdevice.h>

void ether_setup(struct net_device *dev);
#include <linux/etherdevice.h>

void ether_setup(struct net_device *dev);

将大多数设备方法设置为以太网驱动程序的通用实现的函数。如果名称中的第一个字符是空格或字符,它还会设置dev->flags并分配下一个可用的ethx 名称。dev->nameNULL

Function that sets most device methods to the general-purpose implementation for Ethernet drivers. It also sets dev->flags and assigns the next available ethx name to dev->name if the first character in the name is a blank space or the NULL character.

unsigned short eth_type_trans(struct sk_buff *skb, struct net_device *dev);
unsigned short eth_type_trans(struct sk_buff *skb, struct net_device *dev);

当以太网接口收到报文时,可以调用该函数进行设置 skb->pkt_type。返回值是一个协议号,通常存储在skb->protocol.

When an Ethernet interface receives a packet, this function can be called to set skb->pkt_type. The return value is a protocol number that is usually stored in skb->protocol.

#include <linux/sockios.h>

SIOCDEVPRIVATE
#include <linux/sockios.h>

SIOCDEVPRIVATE

16 个ioctl命令中的第一个命令可以由每个驱动程序实现以供其自己使用。所有网络ioctl命令都在sockios.h 中定义。

The first of 16 ioctl commands that can be implemented by each driver for its own private use. All the network ioctl commands are defined in sockios.h.

#include <linux/mii.h>

struct mii_if_info;
#include <linux/mii.h>

struct mii_if_info;

支持实现 MII 标准的设备驱动程序的声明和结构。

Declarations and a structure supporting drivers of devices that implement the MII standard.

#include <linux/ethtool.h>

struct ethtool_ops;
#include <linux/ethtool.h>

struct ethtool_ops;

让设备与 ethtool实用程序。

Declarations and structures that let devices work with the ethtool utility.




[ 1 ] NAPI代表“新API”;网络黑客更擅长创建接口而不是命名接口。

[1] NAPI stands for "new API"; the networking hackers are better at creating interfaces than naming them.

[ 2 ]请注意,根据<linux/sockios.h>,这些SIOCDEVPRIVATE命令已被弃用。然而,什么应该取代它们尚不清楚,并且许多树内驱动程序仍在使用它们。

[2] Note that, according to <linux/sockios.h>, the SIOCDEVPRIVATE commands are deprecated. What should replace them is not clear, however, and numerous in-tree drivers still use them.

第 18 章 TTY 驱动程序

Chapter 18. TTY Drivers

tty 设备获取其 该名称来自电传打字机的非常古老的缩写,最初仅与 Unix 机器的物理或虚拟终端连接相关。随着时间的推移,这个名字也开始意味着任何串行端口类型的设备,因为终端连接也可以通过这样的连接创建。物理 tty 设备的一些示例包括串行端口、USB 到串行端口转换器以及某些需要特殊处理才能正常工作的调制解调器类型(例如传统的 WinModem 类型设备)。tty 虚拟设备支持虚拟控制台,用于通过键盘、网络连接或 xterm 会话登录计算机。

A tty device gets its name from the very old abbreviation of teletypewriter and was originally associated only with the physical or virtual terminal connection to a Unix machine. Over time, the name also came to mean any serial port style device, as terminal connections could also be created over such a connection. Some examples of physical tty devices are serial ports, USB-to-serial-port converters, and some types of modems that need special processing to work properly (such as the traditional WinModem style devices). tty virtual devices support virtual consoles that are used to log into a computer, from either the keyboard, over a network connection, or through a xterm session.

Linux tty 驱动程序核心位于标准字符驱动程序级别的正下方,并提供一系列功能,重点是为终端类型设备提供使用的接口。核心负责控制 tty 设备上的数据流和数据格式。这使得 tty 驱动程序能够专注于处理进出硬件的数据,而不用担心如何以一致的方式控制与用户空间的交互。为了控制数据流,有许多不同的线路规则可以虚拟地“插入”到任何 tty 设备中。这是由不同的 tty 线路规则驱动程序完成的。

The Linux tty driver core lives right below the standard character driver level and provides a range of features focused on providing an interface for terminal style devices to use. The core is responsible for controlling both the flow of data across a tty device and the format of the data. This allows tty drivers to focus on handling the data to and from the hardware, instead of worrying about how to control the interaction with user space in a consistent way. To control the flow of data, there are a number of different line disciplines that can be virtually "plugged" into any tty device. This is done by different tty line discipline drivers.

如图18-1所示,tty 核心从用户处获取要发送到 tty 设备的数据。然后它将其传递给 tty 线路规则驱动程序,后者再将其传递给 tty 驱动程序。tty 驱动程序将数据转换为可以发送到硬件的格式。从 tty 硬件接收到的数据通过 tty 驱动程序返回,进入 tty 线路规则驱动程序,并进入 tty 核心,用户可以在其中检索数据。有时tty驱动程序直接与tty核心通信,tty核心直接向tty驱动程序发送数据,但通常tty线路纪律有机会修改两者之间发送的数据。

As Figure 18-1 shows, the tty core takes data from a user that is to be sent to a tty device. It then passes it to a tty line discipline driver, which then passes it to the tty driver. The tty driver converts the data into a format that can be sent to the hardware. Data being received from the tty hardware flows back up through the tty driver, into the tty line discipline driver, and into the tty core, where it can be retrieved by a user. Sometimes the tty driver communicates directly to the tty core, and the tty core sends data directly to the tty driver, but usually the tty line discipline has a chance to modify the data that is sent between the two.

tty 核心概述

图 18-1。tty 核心概述

Figure 18-1. tty core overview

tty 驱动程序永远看不到 tty 线路规则。司机无法直接与线路纪律人员沟通,也没有意识到它的存在。驱动程序的工作是以硬件可以理解的方式格式化发送给它的数据,并从硬件接收数据。tty 线路规程的工作是以特定方式格式化从用户或硬件接收到的数据。这种格式化通常采用协议转换的形式,例如 PPP 或蓝牙。

The tty driver never sees the tty line discipline. The driver cannot communicate directly with the line discipline, nor does it realize it is even present. The driver's job is to format data that is sent to it in a manner that the hardware can understand, and receive data from the hardware. The tty line discipline's job is to format the data received from a user, or the hardware, in a specific manner. This formatting usually takes the form of a protocol conversion, such as PPP or Bluetooth.

共有三种不同类型的 tty 驱动程序:控制台、串行端口和 pty。控制台和 pty 驱动程序已经编写完毕,并且可能是这些类型的 tty 驱动程序中唯一需要的。这使得任何使用 tty 核心与用户和系统交互的新驱动程序都作为串行端口驱动程序。

There are three different types of tty drivers: console, serial port, and pty. The console and pty drivers have already been written and probably are the only ones needed of these types of tty drivers. This leaves any new drivers using the tty core to interact with the user and the system as serial port drivers.

要确定内核中当前加载的 tty 驱动程序类型以及当前存在的 tty 设备,请查看/proc/tty/drivers文件。该文件包含当前存在的不同 tty 驱动程序的列表,显示驱动程序的名称、默认节点名称、驱动程序的主设备号、驱动程序使用的次设备范围以及 tty 驱动程序的类型。以下是该文件的示例:

To determine what kind of tty drivers are currently loaded in the kernel and what tty devices are currently present, look at the /proc/tty/drivers file. This file consists of a list of the different tty drivers currently present, showing the name of the driver, the default node name, the major number for the driver, the range of minors used by the driver, and the type of the tty driver. The following is an example of this file:

/dev/tty /dev/tty 5 0 系统:/dev/tty
/dev/console /dev/console 5 1 系统:控制台
/dev/ptmx /dev/ptmx 5 2 系统
/dev/vc/0 /dev/vc/0 4 0 系统:vtmaster
usbserial /dev/ttyUSB 188 0-254 串行
串行 /dev/ttyS 4 64-67 串行
pty_slave /dev/pts 136 0-255 pty:从属
pty_master /dev/ptm 128 0-255 pty:master
pty_slave /dev/ttyp 3 0-255 pty:从属
pty_master /dev/pty 2 0-255 pty:master
未知 /dev/tty 4 1-63 控制台
/dev/tty             /dev/tty        5       0 system:/dev/tty
/dev/console         /dev/console    5       1 system:console
/dev/ptmx            /dev/ptmx       5       2 system
/dev/vc/0            /dev/vc/0       4       0 system:vtmaster
usbserial            /dev/ttyUSB   188   0-254 serial
serial               /dev/ttyS       4   64-67 serial
pty_slave            /dev/pts      136   0-255 pty:slave
pty_master           /dev/ptm      128   0-255 pty:master
pty_slave            /dev/ttyp       3   0-255 pty:slave
pty_master           /dev/pty        2   0-255 pty:master
unknown              /dev/tty        4    1-63 console

/proc/tty/驱动程序/ 目录包含某些 tty 驱动程序的单独文件(如果它们实现该功能)。默认串行驱动程序在此目录中创建一个文件,该文件显示有关硬件的许多串行端口特定信息。稍后将介绍如何在此目录中创建文件的信息。

The /proc/tty/driver/ directory contains individual files for some of the tty drivers, if they implement that functionality. The default serial driver creates a file in this directory that shows a lot of serial-port-specific information about the hardware. Information on how to create a file in this directory is described later.

当前在内核中注册和存在的所有 tty 设备在/sys/class/tty下都有自己的子目录。在该子目录中,有一个“dev”文件,其中包含分配给该 tty 设备的主设备号和次设备号。如果驱动程序告诉内核与 tty 设备关联的物理设备和驱动程序的位置,它会创建返回到它们的符号链接。这棵树的一个例​​子是:

All of the tty devices currently registered and present in the kernel have their own subdirectory under /sys/class/tty. Within that subdirectory, there is a "dev" file that contains the major and minor number assigned to that tty device. If the driver tells the kernel the locations of the physical device and driver associated with the tty device, it creates symlinks back to them. An example of this tree is:

/系统/类/tty/
|-- 控制台
| `-- 开发
|-- ptmx
| `-- 开发
|-- 终端
| `-- 开发
|-- tty0
| `-- 开发
   ...
|-- ttyS1
| `-- 开发
|-- ttyS2
| `-- 开发
|-- ttyS3
| `-- 开发
   ...
|-- ttyUSB0
| |-- 开发
| |-- 设备-> ../../../devices/pci0000:00/0000:00:09.0/usb3/3-1/3-1:1.0/ttyUSB0
| `-- 驱动程序 -> ../../../bus/usb-serial/drivers/keyspan_4
|-- ttyUSB1
| |-- 开发
| |-- 设备-> ../../../devices/pci0000:00/0000:00:09.0/usb3/3-1/3-1:1.0/ttyUSB1
| `-- 驱动程序 -> ../../../bus/usb-serial/drivers/keyspan_4
|-- ttyUSB2
| |-- 开发
| |-- 设备-> ../../../devices/pci0000:00/0000:00:09.0/usb3/3-1/3-1:1.0/ttyUSB2
| `-- 驱动程序 -> ../../../bus/usb-serial/drivers/keyspan_4
`--ttyUSB3
    |-- 开发
    |-- 设备-> ../../../devices/pci0000:00/0000:00:09.0/usb3/3-1/3-1:1.0/ttyUSB3
    `-- 驱动程序 -> ../../../bus/usb-serial/drivers/keyspan_4
/sys/class/tty/
|-- console
|   `-- dev
|-- ptmx
|   `-- dev
|-- tty
|   `-- dev
|-- tty0
|   `-- dev
   ... 
|-- ttyS1
|   `-- dev
|-- ttyS2
|   `-- dev
|-- ttyS3
|   `-- dev
   ...
|-- ttyUSB0
|   |-- dev
|   |-- device -> ../../../devices/pci0000:00/0000:00:09.0/usb3/3-1/3-1:1.0/ttyUSB0
|   `-- driver -> ../../../bus/usb-serial/drivers/keyspan_4
|-- ttyUSB1
|   |-- dev
|   |-- device -> ../../../devices/pci0000:00/0000:00:09.0/usb3/3-1/3-1:1.0/ttyUSB1
|   `-- driver -> ../../../bus/usb-serial/drivers/keyspan_4
|-- ttyUSB2
|   |-- dev
|   |-- device -> ../../../devices/pci0000:00/0000:00:09.0/usb3/3-1/3-1:1.0/ttyUSB2
|   `-- driver -> ../../../bus/usb-serial/drivers/keyspan_4
`-- ttyUSB3
    |-- dev
    |-- device -> ../../../devices/pci0000:00/0000:00:09.0/usb3/3-1/3-1:1.0/ttyUSB3
    `-- driver -> ../../../bus/usb-serial/drivers/keyspan_4

小型 TTY 驱动程序

A Small TTY Driver

为了解释 tty 核心的工作原理,我们创建了一个小型 tty 驱动程序,可以加载、写入、读取和卸载。任何 tty 驱动程序的主要数据结构都是 struct tty_driver. 它用于向 tty 核心注册和取消注册 tty 驱动程序,并在内核头文件<linux/tty_driver.h>中进行了描述。

To explain how the tty core works, we create a small tty driver that can be loaded, written to and read from, and unloaded. The main data structure of any tty driver is the struct tty_driver. It is used to register and unregister a tty driver with the tty core and is described in the kernel header file <linux/tty_driver.h>.

要创建一个struct tty_driver, 函数 必须使用该驱动程序支持的 tty 设备数量作为参数来调用alloc_tty_driver 。这可以通过以下简短代码来完成:

To create a struct tty_driver, the function alloc_tty_driver must be called with the number of tty devices this driver supports as the paramater. This can be done with the following brief code:

/*分配tty驱动*/
tiny_tty_driver = alloc_tty_driver(TINY_TTY_MINORS);
如果(!tiny_tty_driver)
    返回-ENOMEM;
/* allocate the tty driver */
tiny_tty_driver = alloc_tty_driver(TINY_TTY_MINORS);
if (!tiny_tty_driver)
    return -ENOMEM;

成功调用 alloc_tty_driver函数后,struct tty_driver应根据 tty 驱动程序的需要使用正确的信息进行初始化。该结构包含许多不同的字段,但并非所有字段都必须初始化才能使 tty 驱动程序正常工作。下面是一个示例,展示了如何初始化结构并设置足够的字段来创建工作的 tty 驱动程序。它使用tty_set_operations 函数来帮助复制驱动程序中定义的函数操作集:

After the alloc_tty_driver function is successfully called, the struct tty_driver should be initialized with the proper information based on the needs of the tty driver. This structure contains a lot of different fields, but not all of them have to be initialized in order to have a working tty driver. Here is an example that shows how to initialize the structure and sets up enough of the fields to create a working tty driver. It uses the tty_set_operations function to help copy over the set of function operations that is defined in the driver:

静态结构tty_操作serial_ops = {
    .open=tiny_open,
    .close=tiny_close,
    .write=tiny_write,
    .write_room=tiny_write_room,
    .set_termios = tiny_set_termios,
};

...

    /* 初始化 tty 驱动 */
    tiny_tty_driver->所有者=THIS_MODULE;
    tiny_tty_driver->driver_name = "tiny_tty";
    tiny_tty_driver->name = "ttty";
    tiny_tty_driver->devfs_name = "tts/ttty%d";
    tiny_tty_driver->主要 = TINY_TTY_MAJOR,
    tiny_tty_driver->类型 = TTY_DRIVER_TYPE_SERIAL,
    tiny_tty_driver->子类型 = SERIAL_TYPE_NORMAL,
    tiny_tty_driver->标志= TTY_DRIVER_REAL_RAW | TTY_DRIVER_NO_DEVFS,
    tiny_tty_driver->init_termios = tty_std_termios;
    tiny_tty_driver->init_termios.c_cflag = B9600 | CS8 | 信用 | 哈尔滨工业大学 | 本地;
    tty_set_operations(tiny_tty_driver, &serial_ops);
static struct tty_operations serial_ops = {
    .open = tiny_open,
    .close = tiny_close,
    .write = tiny_write,
    .write_room = tiny_write_room,
    .set_termios = tiny_set_termios,
};

...

    /* initialize the tty driver */
    tiny_tty_driver->owner = THIS_MODULE;
    tiny_tty_driver->driver_name = "tiny_tty";
    tiny_tty_driver->name = "ttty";
    tiny_tty_driver->devfs_name = "tts/ttty%d";
    tiny_tty_driver->major = TINY_TTY_MAJOR,
    tiny_tty_driver->type = TTY_DRIVER_TYPE_SERIAL,
    tiny_tty_driver->subtype = SERIAL_TYPE_NORMAL,
    tiny_tty_driver->flags = TTY_DRIVER_REAL_RAW | TTY_DRIVER_NO_DEVFS,
    tiny_tty_driver->init_termios = tty_std_termios;
    tiny_tty_driver->init_termios.c_cflag = B9600 | CS8 | CREAD | HUPCL | CLOCAL;
    tty_set_operations(tiny_tty_driver, &serial_ops);

上面列出的变量和函数,以及如何使用该结构,将在本章的其余部分进行解释。

The variables and functions listed above, and how this structure is used, are explained in the rest of the chapter.

要向 tty 核心注册此驱动程序,struct tty_driver必须将其传递给 tty_register_driver函数:

To register this driver with the tty core, the struct tty_driver must be passed to the tty_register_driver function:

/* 注册tty驱动*/
retval = tty_register_driver(tiny_tty_driver);
如果(返回值){
    printk(KERN_ERR "注册tiny tty驱动失败");
    put_tty_driver(tiny_tty_driver);
    返回retval;
}
/* register the tty driver */
retval = tty_register_driver(tiny_tty_driver);
if (retval) {
    printk(KERN_ERR "failed to register tiny tty driver");
    put_tty_driver(tiny_tty_driver);
    return retval;
}

当调用tty_register_driver时,内核会为此 tty 驱动程序可以拥有的所有次要设备创建所有不同的 sysfs tty 文件。如果您使用devfs(本书未介绍),除非 TTY_DRIVER_NO_DEVFS指定了该标志, 否则也会创建devfs文件。如果您只想为系统上实际存在的设备调用tty_register_device ,则可以指定该标志,因此用户始终拥有内核中存在的设备的最新视图,这正是devfs用户所期望的。

When tty_register_driver is called, the kernel creates all of the different sysfs tty files for the whole range of minor devices that this tty driver can have. If you use devfs (not covered in this book) and unless the TTY_DRIVER_NO_DEVFS flag is specified, devfs files are created, too. The flag may be specified if you want to call tty_register_device only for the devices that actually exist on the system, so the user always has an up-to-date view of the devices present in the kernel, which is what devfs users expect.

注册自身后,驱动程序通过tty_register_device函数注册其控制的设备 。该函数有三个参数:

After registering itself, the driver registers the devices it controls through the tty_register_device function. This function has three arguments:

  • struct tty_driver指向设备所属的指针。

  • A pointer to the struct tty_driver that the device belongs to.

  • 设备的次要编号。

  • The minor number of the device.

  • struct device指向该 tty 设备所绑定的指针。如果 tty 设备未绑定到任何设备struct device,则可以将此参数设置为NULL

  • A pointer to the struct device that this tty device is bound to. If the tty device is not bound to any struct device, this argument can be set to NULL.

我们的驱动程序会立即注册所有 tty 设备,因为它们是虚拟的并且不绑定到任何物理设备:

Our driver registers all of the tty devices at once, as they are virtual and not bound to any physical devices:

for (i = 0; i < TINY_TTY_MINORS; ++i)
    tty_register_device(tiny_tty_driver, i, NULL);
for (i = 0; i < TINY_TTY_MINORS; ++i)
    tty_register_device(tiny_tty_driver, i, NULL);

要在 tty 核心中取消注册驱动程序,需要通过调用tty_unregister_device来清理通过调用 tty_register_device注册的所有 tty 设备。那么struct tty_driver一定是 通过调用 tty_unregister_driver取消注册:

To unregister the driver with the tty core, all tty devices that were registered by calling tty_register_device need to be cleaned up with a call to tty_unregister_device. Then the struct tty_driver must be unregistered with a call to tty_unregister_driver:

for (i = 0; i < TINY_TTY_MINORS; ++i)
    tty_unregister_device(tiny_tty_driver,我);
tty_unregister_driver(tiny_tty_driver);
for (i = 0; i < TINY_TTY_MINORS; ++i)
    tty_unregister_device(tiny_tty_driver, i);
tty_unregister_driver(tiny_tty_driver);

结构体termios

struct termios

变量init_termios_ struct tty_driver是一个struct termios. 如果在用户初始化端口之前使用端口,则此变量用于提供一组合理的线路设置。驱动程序使用从变量复制的一组标准值来初始化变量tty_std_termiostty_std_termios在 tty 核心中定义为:

The init_termios variable in the struct tty_driver is a struct termios. This variable is used to provide a sane set of line settings if the port is used before it is initialized by a user. The driver initializes the variable with a standard set of values, which is copied from the tty_std_termios variable. tty_std_termios is defined in the tty core as:

结构 termios tty_std_termios = {
    .c_iflag = ICRNL | 伊克森,
    .c_oflag = OPOST | ONLCR,
    .c_cflag = B38400 | CS8 | 信用 | 华大化学有限公司,
    .c_lflag = ISIG | 佳能 | 回声| 回声| 回声|
               ECHOCTL | 回声| 扩展,
    .c_cc = INIT_C_CC
};
struct termios tty_std_termios = {
    .c_iflag = ICRNL | IXON,
    .c_oflag = OPOST | ONLCR,
    .c_cflag = B38400 | CS8 | CREAD | HUPCL,
    .c_lflag = ISIG | ICANON | ECHO | ECHOE | ECHOK |
               ECHOCTL | ECHOKE | IEXTEN,
    .c_cc = INIT_C_CC
};

struct termios结构用于保存 tty 设备上特定端口的所有当前线路设置。这些线路设置控制当前波特率、数据大小、数据流设置和许多其他值。该结构的不同字段是:

The struct termios structure is used to hold all of the current line settings for a specific port on the tty device. These line settings control the current baud rate, data size, data flow settings, and many other values. The different fields of this structure are:

tcflag_t c_iflag;
tcflag_t c_iflag;

输入模式标志

The input mode flags

tcflag_t c_oflag;
tcflag_t c_oflag;

输出模式标志

The output mode flags

tcflag_t c_cflag;
tcflag_t c_cflag;

控制模式标志

The control mode flags

tcflag_t c_lflag;
tcflag_t c_lflag;

本地模式标志

The local mode flags

cc_t c_line;
cc_t c_line;

线路纪律类型

The line discipline type

cc_t c_cc[NCCS];
cc_t c_cc[NCCS];

控制字符数组

An array of control characters

所有模式标志都定义为一个大位字段。模式的不同值及其用途可以在任何 Linux 发行版中提供的 termios 联机帮助页中查看。内核提供了一组有用的宏来获取不同的位。这些宏在头文件include/linux/tty.h中定义。

All of the mode flags are defined as a large bitfield. The different values of the modes, and what they are used for, can be seen in the termios manpages available in any Linux distribution. The kernel provides a set of useful macros to get at the different bits. These macros are defined in the header file include/linux/tty.h.

中定义的所有字段tiny_tty_driver 要使 tty 驱动程序正常工作,变量是必需的。owner为了防止 tty 端口打开时卸载 tty 驱动程序,该字段是必需的。在以前的内核版本中,由 tty 驱动程序本身来处理模块引用计数逻辑。但是内核程序员认为解决所有不同的可能的竞争条件是很困难的,因此 tty 核心现在为 tty 驱动程序处理所有这些控制。

All the fields that were defined in the tiny_tty_driver variable are necessary to have a working tty driver. The owner field is necessary in order to prevent the tty driver from being unloaded while the tty port is open. In previous kernel versions, it was up to the tty driver itself to handle the module reference counting logic. But kernel programmers determined that it would to be difficult to solve all of the different possible race conditions, and so the tty core now handles all of this control for the tty drivers.

和字段看起来非常相似driver_namename但用途不同。该 driver_name变量应该设置为简短的、描述性的并且在内核中的所有 tty 驱动程序中是唯一的。这是因为它显示在/proc/tty/drivers文件中向用户描述驱动程序,并显示在当前加载的 tty 驱动程序的 sysfs tty 类目录中。该name字段用于定义分配给/dev树中此 tty 驱动程序的各个 tty 节点的名称。该字符串用于通过在字符串末尾附加正在使用的 tty 设备的编号来创建 tty 设备。它还用于在 sysfs /sys/class/tty/中创建设备名称 目录。如果在内核中启用了 devfs,则此名称应包含 tty 驱动程序想要放入的任何子目录。例如,内核中的串行驱动程序将名称字段设置为是否启用tts/devfs 。ttyS该字符串也显示在/proc/tty/drivers文件中。

The driver_name and name fields look very similar, yet are used for different purposes. The driver_name variable should be set to something short, descriptive, and unique among all tty drivers in the kernel. This is because it shows up in the /proc/tty/drivers file to describe the driver to the user and in the sysfs tty class directory of tty drivers currently loaded. The name field is used to define a name for the individual tty nodes assigned to this tty driver in the /dev tree. This string is used to create a tty device by appending the number of the tty device being used at the end of the string. It is also used to create the device name in the sysfs /sys/class/tty/ directory. If devfs is enabled in the kernel, this name should include any subdirectory that the tty driver wants to be placed into. As an example, the serial driver in the kernel sets the name field to tts/ if devfs is enabled and ttyS if it is not. This string is also displayed in the /proc/tty/drivers file.

正如我们提到的,/proc/tty/drivers文件显示了所有当前注册的 tty 驱动程序。在内核中注册了tiny_tty驱动程序并且没有devfs 的情况下,该文件如下所示:

As we mentioned, the /proc/tty/drivers file shows all of the currently registered tty drivers. With the tiny_tty driver registered in the kernel and no devfs, this file looks something like the following:

$cat /proc/tty/drivers
tiny_tty /dev/ttty 240 0-3 串行
usbserial /dev/ttyUSB 188 0-254 串行
串行 /dev/ttyS 4 64-107 串行
pty_slave /dev/pts 136 0-255 pty:从属
pty_master /dev/ptm 128 0-255 pty:master
pty_slave /dev/ttyp 3 0-255 pty:从属
pty_master /dev/pty 2 0-255 pty:master
未知 /dev/vc/4 1-63 控制台
/dev/vc/0 /dev/vc/0 4 0 系统:vtmaster
/dev/ptmx /dev/ptmx 5 2 系统
/dev/console /dev/console 5 1 系统:控制台
/dev/tty /dev/tty 5 0 系统:/dev/tty
$ cat /proc/tty/drivers
tiny_tty             /dev/ttty     240     0-3 serial
usbserial            /dev/ttyUSB   188   0-254 serial
serial               /dev/ttyS       4  64-107 serial
pty_slave            /dev/pts      136   0-255 pty:slave
pty_master           /dev/ptm      128   0-255 pty:master
pty_slave            /dev/ttyp       3   0-255 pty:slave
pty_master           /dev/pty        2   0-255 pty:master
unknown              /dev/vc/        4    1-63 console
/dev/vc/0            /dev/vc/0       4       0 system:vtmaster
/dev/ptmx            /dev/ptmx       5       2 system
/dev/console         /dev/console    5       1 system:console
/dev/tty             /dev/tty        5       0 system:/dev/tty

另外, 当tiny_tty驱动程序注册到tty核心时, sysfs目录 /sys/class/tty看起来如下所示:

Also, the sysfs directory /sys/class/tty looks something like the following when the tiny_tty driver is registered with the tty core:

$tree /sys/class/tty/ttty*
/sys/class/tty/ttty0
`-- 开发
/sys/class/tty/ttty1
`-- 开发
/sys/class/tty/ttty2
`-- 开发
/sys/class/tty/ttty3
`-- 开发

$ cat /sys/class/tty/ttty0/dev 
240:0
$ tree /sys/class/tty/ttty*
/sys/class/tty/ttty0
`-- dev
/sys/class/tty/ttty1
`-- dev
/sys/class/tty/ttty2
`-- dev
/sys/class/tty/ttty3
`-- dev

$ cat /sys/class/tty/ttty0/dev 
240:0

主要变量描述了该驱动程序的主要编号。type 和 subtype 变量声明该驱动程序是什么类型的 tty 驱动程序。对于我们的示例,我们是“普通”类型的串行驱动程序。tty 驱动程序的唯一其他子类型是“callout”类型。标注设备传统上用于控制设备的线路设置。数据将通过一个设备节点发送和接收,任何线路设置更改都将发送到另一设备节点,即标注设备。这需要为每个 tty 设备使用两个次设备号。值得庆幸的是,几乎所有驱动程序都处理同一设备节点上的数据和线路设置,并且新驱动程序很少使用标注类型。

The major variable describes what the major number for this driver is. The type and subtype variables declare what type of tty driver this driver is. For our example, we are a serial driver of a "normal" type. The only other subtype for a tty driver would be a "callout" type. Callout devices were traditionally used to control the line settings of a device. The data would be sent and received through one device node, and any line setting changes would be sent to a different device node, which was the callout device. This required the use of two minor numbers for every single tty device. Thankfully, almost all drivers handle both the data and line settings on the same device node, and the callout type is rarely used for new drivers.

flagstty 驱动程序和 tty 核心都使用该变量来指示驱动程序的当前状态以及它是什么类型的 tty 驱动程序。定义了几个位掩码宏,在测试或操作标志时必须使用它们。flags驱动程序可以设置变量中的三位:

The flags variable is used by both the tty driver and the tty core to indicate the current state of the driver and what kind of tty driver it is. Several bitmask macros are defined that you must use when testing or manipulating the flags. Three bits in the flags variable can be set by the driver:

TTY_DRIVER_RESET_TERMIOS
TTY_DRIVER_RESET_TERMIOS

该标志表明每当最后一个进程关闭设备时 tty 核心都会重置 termios 设置。这对于控制台和 pty 驱动程序很有用。例如,假设用户让终端处于奇怪的状态。设置此标志后,当用户注销或控制会话的进程被“终止”时,终端将重置为正常值。

This flag states that the tty core resets the termios setting whenever the last process has closed the device. This is useful for the console and pty drivers. For instance, suppose the user leaves a terminal in a weird state. With this flag set, the terminal is reset to a normal value when the user logs out or the process that controlled the session is "killed."

TTY_DRIVER_REAL_RAW
TTY_DRIVER_REAL_RAW

该标志表明 tty 驱动程序保证发送奇偶校验通知或中断字符的最新规则。这允许线路规则以更快的方式处理接收到的字符,因为它不必检查从 tty 驱动程序接收到的每个字符。由于速度优势,通常为所有 tty 驱动程序设置此值。

This flag states that the tty driver guarantees to send notifications of parity or break characters up-to-the-line discipline. This allows the line discipline to process received characters in a much quicker manner, as it does not have to inspect every character received from the tty driver. Because of the speed benefits, this value is usually set for all tty drivers.

TTY_DRIVER_NO_DEVFS
TTY_DRIVER_NO_DEVFS

该标志表明,当调用tty_register_driver时 ,tty 核心不会为 tty 设备创建任何 devfs 条目。这对于动态创建和销毁次要设备的任何驱动程序都很有用。设置此选项的驱动程序示例包括 USB 转串口驱动程序、USB 调制解调器驱动程序、USB 蓝牙 tty 驱动程序和许多标准串行端口驱动程序。

This flag states that when the call to tty_register_driver is made, the tty core does not create any devfs entries for the tty devices. This is useful for any driver that dynamically creates and destroys the minor devices. Examples of drivers that set this are the USB-to-serial drivers, the USB modem driver, the USB Bluetooth tty driver, and a number of the standard serial port drivers.

当 tty 驱动程序稍后想要向 tty 核心注册特定的 tty 设备时,它必须调用tty_register_device,并使用指向 tty 驱动程序的指针以及已创建的设备的次设备号。如果不这样做,tty 核心仍会将所有调用传递给 tty 驱动程序,但某些与 tty 相关的内部功能可能不存在。这包括新 tty 设备的/sbin/hotplug通知和 tty 设备的 sysfs 表示。当已注册的 tty 设备从计算机中删除时,tty 驱动程序必须调用tty_unregister_device

When the tty driver later wants to register a specific tty device with the tty core, it must call tty_register_device, with a pointer to the tty driver, and the minor number of the device that has been created. If this is not done, the tty core still passes all calls to the tty driver, but some of the internal tty-related functionality might not be present. This includes /sbin/hotplug notification of new tty devices and sysfs representation of the tty device. When the registered tty device is removed from the machine, the tty driver must call tty_unregister_device.

该变量中剩余的一位由 tty 核心控制,称为 TTY_DRIVER_INSTALLED. 该标志是在驱动程序注册后由 tty 核心设置的,并且永远不应该由 tty 驱动程序。

The one remaining bit in this variable is controlled by the tty core and is called TTY_DRIVER_INSTALLED. This flag is set by the tty core after the driver has been registered and should never be set by a tty driver.

tty_driver 函数指针

tty_driver Function Pointers

最后,tiny_tty驱动程序声明了四个 函数指针。

Finally, the tiny_tty driver declares four function pointers.

打开和关闭

open and close

当用户调用 tty 驱动程序分配到的设备节点时,tty 核心将调用 open函数。opentty 核心使用指向分配给tty_struct该设备的结构的指针和文件指针来调用它。该 open字段必须由 tty 驱动程序设置才能正常工作;否则,-ENODEV当调用 open 时,将返回给用户。

The open function is called by the tty core when a user calls open on the device node the tty driver is assigned to. The tty core calls this with a pointer to the tty_struct structure assigned to this device, and a file pointer. The open field must be set by a tty driver for it to work properly; otherwise, -ENODEV is returned to the user when open is called.

当这个打开的时候 当调用函数时,tty 驱动程序应将一些数据保存在 tty_struct传递给它的变量中,或者将数据保存在可根据端口次要编号引用的静态数组中。这是必要的,以便 tty 驱动程序知道在调用稍后的 close、write 和其他函数时正在引用哪个设备。

When this open function is called, the tty driver is expected to either save some data within the tty_struct variable that is passed to it, or save the data within a static array that can be referenced based on the minor number of the port. This is necessary so the tty driver knows which device is being referenced when the later close, write, and other functions are called.

tiny_tty驱动程序在 tty 结构中保存一个指针,如以下代码所示:

The tiny_tty driver saves a pointer within the tty structure, as can be seen with the following code:

static int tiny_open(struct tty_struct *tty, struct file *file)
{
    结构体tiny_serial *tiny;
    结构体timer_list *定时器;
    整数索引;

    /* 初始化指针以防失败 */
    tty->driver_data = NULL;

    /* 获取与该 tty 指针关联的串行对象 */
    索引 = tty->索引;
    微小=微小_表[索引];
    如果(微小== NULL){
        /* 第一次访问这个设备,让我们创建它 */
        微小 = kmalloc(sizeof(*tiny), GFP_KERNEL);
        如果(!微小)
            返回-ENOMEM;

        init_MUTEX(&tiny->sem);
        小->open_count = 0;
        小->计时器= NULL;

        tiny_table[索引] = 微小;
    }

    向下(&tiny->sem);

    /* 将我们的结构保存在 tty 结构中 */
    tty->driver_data = 小;
    小->tty = tty;
static int tiny_open(struct tty_struct *tty, struct file *file)
{
    struct tiny_serial *tiny;
    struct timer_list *timer;
    int index;

    /* initialize the pointer in case something fails */
    tty->driver_data = NULL;

    /* get the serial object associated with this tty pointer */
    index = tty->index;
    tiny = tiny_table[index];
    if (tiny =  = NULL) {
        /* first time accessing this device, let's create it */
        tiny = kmalloc(sizeof(*tiny), GFP_KERNEL);
        if (!tiny)
            return -ENOMEM;

        init_MUTEX(&tiny->sem);
        tiny->open_count = 0;
        tiny->timer = NULL;

        tiny_table[index] = tiny;
    }

    down(&tiny->sem);

    /* save our structure within the tty structure */
    tty->driver_data = tiny;
    tiny->tty = tty;

在此代码中,tiny_serial结构体保存在tty结构体中。这允许 tiny_writetiny_write_roomtiny_close函数检索tiny_serial结构并正确操作它。

In this code, the tiny_serial structure is saved within the tty structure. This allows the tiny_write, tiny_write_room, and tiny_close functions to retrieve the tiny_serial structure and manipulate it properly.

tiny_serial结构定义为:

The tiny_serial structure is defined as:

结构体tiny_serial {
    struct tty_struct *tty; /* 指向该设备的 tty 的指针 */
    int open_count; /* 该端口被打开的次数 */
    结构信号量 sem;/* 锁定该结构 */
    结构体timer_list *定时器;

};
struct tiny_serial {
    struct tty_struct   *tty;       /* pointer to the tty for this device */
    int         open_count; /* number of times this port has been opened */
    struct semaphore    sem;        /* locks this structure */
    struct timer_list   *timer;

};

正如我们所看到的,该变量在第一次打开端口时在 open 调用中open_count初始化。0这是一个典型的参考计数器,需要它,因为 打开关闭可以为同一设备多次调用 tty 驱动程序的函数,以便允许多个进程读取和写入数据。为了正确处理所有事情,必须记录端口打开或关闭的次数;当端口被使用时,驱动程序会增加和减少计数。当端口第一次打开时,可以完成任何所需的硬件初始化和内存分配。当端口最后一次关闭时,可以完成任何所需的硬件关闭和内存清理。

As we've seen, the open_count variable is initialized to 0 in the open call the first time the port is opened. This is a typical reference counter, needed because the open and close functions of a tty driver can be called multiple times for the same device in order to allow multiple processes to read and write data. To handle everything correctly, a count of how many times the port has been opened or closed must be kept; the driver increments and decrements the count as the port is used. When the port is opened for the first time, any needed hardware initialization and memory allocation can be done. When the port is closed for the last time, any needed hardware shutdown and memory cleanup can be done.

tiny_open函数的其余部分展示了如何跟踪设备被打开的次数:

The rest of the tiny_open function shows how to keep track of the number of times the device has been opened:

    ++tiny->open_count;
    if (tiny->open_count == 1) {
        /* 这是该端口第一次打开 */
        /* 进行此处所需的任何硬件初始化 */
    ++tiny->open_count;
    if (tiny->open_count =  = 1) {
        /* this is the first time this port is opened */
        /* do any hardware initialization needed here */

如果发生阻止打开成功的情况,打开函数必须返回一个负错误号,或者0指示成功。

The open function must return either a negative error number if something has happened to prevent the open from being successful, or a 0 to indicate success.

当用户在先前通过调用 open 创建的文件句柄上调用close,tty 核心将调用 close函数指针。这表明此时应该关闭设备。然而,由于open函数可以被多次调用,所以close函数也可以被多次调用。所以这个函数应该跟踪它被调用了多少次,以确定此时是否真的应该关闭硬件。tiny_tty驱动程序 使用以下代码执行此操作:

The close function pointer is called by the tty core when close is called by a user on the file handle that was previously created with a call to open. This indicates that the device should be closed at this time. However, since the open function can be called more than once, the close function also can be called more than once. So this function should keep track of how many times it has been called to determine if the hardware should really be shut down at this time. The tiny_tty driver does this with the following code:

静态无效 do_close(结构tiny_serial *tiny)
{
    向下(&tiny->sem);

    if (!tiny->open_count) {
        /* 端口从未打开 */
        转到退出;
    }

    --tiny->open_count;
    if (tiny->open_count <= 0) {
        /* 端口被最后一个用户关闭。*/
        /* 在这里做任何硬件特定的事情 */

        /* 关闭定时器 */
        del_timer(微小->计时器);
    }
出口:
    向上(&tiny->sem);
}

static void tiny_close(struct tty_struct *tty, struct file *file)
{
    struct tiny_serial *tiny = tty->driver_data;

    如果(小)
        do_close(小);
}
static void do_close(struct tiny_serial *tiny)
{
    down(&tiny->sem);

    if (!tiny->open_count) {
        /* port was never opened */
        goto exit;
    }

    --tiny->open_count;
    if (tiny->open_count <= 0) {
        /* The port is being closed by the last user. */
        /* Do any hardware specific stuff here */

        /* shut down our timer */
        del_timer(tiny->timer);
    }
exit:
    up(&tiny->sem);
}

static void tiny_close(struct tty_struct *tty, struct file *file)
{
    struct tiny_serial *tiny = tty->driver_data;

    if (tiny)
        do_close(tiny);
}

关闭 函数只调用do_close 函数来完成关闭设备的实际工作。这样做是为了不必在此处以及卸载驱动程序且端口打开时重复关闭逻辑。close函数 没有返回值,因为它不应该能够 失败。

The tiny_close function just calls the do_close function to do the real work of closing the device. This is done so that the shutdown logic does not have to be duplicated here and when the driver is unloaded and a port is open. The close function has no return value, as it is not supposed to be able to fail.

数据流

Flow of Data

当有数据要发送到硬件时,用户调用写函数调用。首先,tty 核心接收调用,然后将数据传递给 tty 驱动程序的写入函数。tty 核心还告诉 tty 驱动程序正在发送的数据的大小。

The write function call is called by the user when there is data to be sent to the hardware. First the tty core receives the call, and then it passes the data on to the tty driver's write function. The tty core also tells the tty driver the size of the data being sent.

有时,由于 tty 硬件的速度和缓冲区容量的原因,在调用write函数时并非可以发送写入程序请求的所有字符 。写入函数应返回能够发送到硬件(或稍后排队发送)的字符数,以便用户程序可以检查所有数据是否确实已写入在用户空间中完成此检查比让内核驱动程序坐下来睡觉直到所有请求的数据都能够发送出去要容易得多。如果写入过程中出现任何错误调用时,应返回负错误值而不是写入的字符数。

Sometimes, because of the speed and buffer capacity of the tty hardware, not all characters requested by the writing program can be sent at the moment the write function is called. The write function should return the number of characters that was able to be sent to the hardware (or queued to be sent at a later time), so that the user program can check if all of the data really was written. It is much easier for this check to be done in user space than it is for a kernel driver to sit and sleep until all of the requested data is able to be sent out. If any errors happen during the write call, a negative error value should be returned instead of the number of characters that were written.

可以从中断上下文和用户上下文调用写入函数。了解这一点很重要,因为 tty 驱动程序不应调用任何在中断上下文中可能休眠的函数。其中包括可能调用Schedule 的任何函数,例如常用函数 copy_from_userkmallocprintk。如果你真的想睡觉,一定要先检查司机是否在通过调用in_interrupt在中断上下文中。

The write function can be called from both interrupt context and user context. This is important to know, as the tty driver should not call any functions that might sleep when it is in interrupt context. These include any function that might possibly call schedule, such as the common functions copy_from_user, kmalloc, and printk. If you really want to sleep, make sure to check first whether the driver is in interrupt context by calling in_interrupt.

这个样本tiny tty驱动程序不连接到任何真实的硬件,因此它的写入函数只是在内核调试日志中记录应该写入的数据。它使用以下代码执行此操作:

This sample tiny tty driver does not connect to any real hardware, so its write function simply records in the kernel debug log what data was supposed to be written. It does this with the following code:

静态 int tiny_write(struct tty_struct *tty,
              const unsigned char *缓冲区,int 计数)
{
    struct tiny_serial *tiny = tty->driver_data;
    整数我;
    int retval = -EINVAL;

    如果(!微小)
        返回-ENODEV;

    向下(&tiny->sem);

    if (!tiny->open_count)
        /* 端口没有打开 */
        转到退出;

    /* 通过硬件端口伪造发送数据
     * 将其写入内核调试日志。
     */
    printk(KERN_DEBUG "%s - ", _ _FUNCTION_ _);
    for (i = 0; i < 计数; ++i)
        printk("%02x", 缓冲区[i]);
    printk("\n");
        
出口:
    向上(&tiny->sem);
    返回retval;
}
static int tiny_write(struct tty_struct *tty, 
              const unsigned char *buffer, int count)
{
    struct tiny_serial *tiny = tty->driver_data;
    int i;
    int retval = -EINVAL;

    if (!tiny)
        return -ENODEV;

    down(&tiny->sem);

    if (!tiny->open_count)
        /* port was not opened */
        goto exit;

    /* fake sending the data out a hardware port by
     * writing it to the kernel debug log.
     */
    printk(KERN_DEBUG "%s - ", _ _FUNCTION_ _);
    for (i = 0; i < count; ++i)
        printk("%02x ", buffer[i]);
    printk("\n");
        
exit:
    up(&tiny->sem);
    return retval;
}

当 tty 子系统本身需要向 tty 设备发送一些数据时,可以调用write函数。如果 tty 驱动程序未实现 .txt文件中的put_chartty_struct函数,则可能会发生这种情况。在这种情况下,tty 核心使用数据大小为 1 的write 函数回调。当 tty 核心想要将换行符转换为换行符加换行符时,通常会发生这种情况。这里可能发生的最大问题是 tty 驱动程序的write函数不能返回0对于这种呼叫。这意味着驱动程序必须将该字节的数据写入设备,因为调用方(tty 核心)不会缓冲数据并稍后重试。由于write函数无法确定它是否在put_char的位置被调用,即使只发送一个字节的数据,也请尝试实现 write函数,以便它在返回之前始终至少写入一个字节。当前的许多 USB 转串行 tty 驱动程序不遵循此规则,因此,某些终端类型在连接到它们时无法正常工作。

The write function can be called when the tty subsystem itself needs to send some data out the tty device. This can happen if the tty driver does not implement the put_char function in the tty_struct. In that case, the tty core uses the write function callback with a data size of 1. This commonly happens when the tty core wants to convert a newline character to a line feed plus a newline character. The biggest problem that can occur here is that the tty driver's write function must not return 0 for this kind of call. This means that the driver must write that byte of data to the device, as the caller (the tty core) does not buffer the data and try again at a later time. As the write function can not determine if it is being called in the place of put_char, even if only one byte of data is being sent, try to implement the write function so it always writes at least one byte before returning. A number of the current USB-to-serial tty drivers do not follow this rule, and because of this, some terminals types do not work properly when connected to them.

当 tty 核心想要知道 tty 驱动程序的写入缓冲区中有多少空间可用时,将调用write_room函数。随着字符从写入缓冲区清空以及调用写入 函数并向缓冲区添加字符,该数字会随着时间的推移而变化。

The write_room function is called when the tty core wants to know how much room in the write buffer the tty driver has available. This number changes over time as characters empty out of the write buffers and as the write function is called, adding characters to the buffer.

静态 int tiny_write_room(struct tty_struct *tty)
{
    struct tiny_serial *tiny = tty->driver_data;
    int 房间 = -EINVAL;

    如果(!微小)
        返回-ENODEV;

    向下(&tiny->sem);
    
    if (!tiny->open_count) {
        /* 端口没有打开 */
        转到退出;
    }

    /* 计算设备还剩多少空间 */
    房间=255;

出口:
    向上(&tiny->sem);
    回程室;
}
static int tiny_write_room(struct tty_struct *tty) 
{
    struct tiny_serial *tiny = tty->driver_data;
    int room = -EINVAL;

    if (!tiny)
        return -ENODEV;

    down(&tiny->sem);
    
    if (!tiny->open_count) {
        /* port was not opened */
        goto exit;
    }

    /* calculate how much room is left in the device */
    room = 255;

exit:
    up(&tiny->sem);
    return room;
}

其他缓冲功能

Other Buffering Functions

缓冲区中的字符 tty_driver为了使 tty 驱动程序正常工作,结构 中的函数不是必需的,但建议这样做。当 tty 核心想要知道 tty 驱动程序的写入缓冲区中还剩余多少个字符需要发送出去时,调用此函数。如果驱动程序可以在将字符发送到硬件之前存储字符,则它应该实现此函数,以便 tty 核心能够确定驱动程序中的所有数据是否已耗尽。

The chars_in_buffer function in the tty_driver structure is not required in order to have a working tty driver, but it is recommended. This function is called when the tty core wants to know how many characters are still remaining in the tty driver's write buffer to be sent out. If the driver can store characters before it sends them out to the hardware, it should implement this function in order for the tty core to be able to determine if all of the data in the driver has drained out.

结构中的三个函数回调tty_driver 可用于刷新驱动程序保留的任何剩余数据。这些不是必须实现的,但如果 tty 驱动程序可以在将数据发送到硬件之前缓冲数据,则建议实现这些。前两个函数回调称为 flush_charswait_until_sent当 tty 核心使用put_char函数回调将多个字符发送到 tty 驱动程序时,将调用这些函数。的 flush_chars当 tty 核心希望 tty 驱动程序开始将这些字符发送到硬件(如果尚未启动)时,将调用函数回调。在所有数据发送到硬件之前,允许该函数返回。wait_until_sent函数回调工作方式大致相同;但它必须等到所有字符都发送完毕后才返回到 tty 核心,或者直到传入的超时值到期,以先发生者为准。允许 tty 驱动程序在此函数内休眠以完成它。如果传递给wait_until_sent函数回调的超时值 设置为0,该函数应该等待直到操作完成。

Three functions callbacks in the tty_driver structure can be used to flush any remaining data that the driver is holding on to. These are not required to be implemented, but are recommended if the tty driver can buffer data before it sends it to the hardware. The first two function callbacks are called flush_chars and wait_until_sent. These functions are called when the tty core has sent a number of characters to the tty driver using the put_char function callback. The flush_chars function callback is called when the tty core wants the tty driver to start sending these characters out to the hardware, if it hasn't already started. This function is allowed to return before all of the data is sent out to the hardware. The wait_until_sent function callback works much the same way; but it must wait until all of the characters are sent before returning to the tty core or until the passed in timeout value has expired, whichever occurrence happens first. The tty driver is allowed to sleep within this function in order to complete it. If the timeout value passed to the wait_until_sent function callback is set to 0, the function should wait until it is finished with the operation.

剩下的数据刷新函数回调是flush_buffer。当 tty 驱动程序要将其写入缓冲区中的所有数据从内存中刷新时,它由 tty 核心调用。缓冲区中剩余的任何数据都会丢失并且不会发送到设备。

The remaining data flushing function callback is flush_buffer. It is called by the tty core when the tty driver is to flush all of the data still in its write buffers out of memory. Any data remaining in the buffer is lost and not sent to the device.

没有读功能?

No read Function?

仅使用这些函数,就可以注册tiny_tty驱动程序,打开设备节点,将数据写入设备,关闭设备节点,以及从内核中取消注册和卸载驱动程序。但tty核心和tty_driver结构不提供读取功能;换句话说; 不存在用于将数据从驱动程序获取到 tty 核心的函数回调。

With only these functions, the tiny_tty driver can be registered, a device node opened, data written to the device, the device node closed, and the driver unregistered and unloaded from the kernel. But the tty core and tty_driver structure do not provide a read function; in other words; no function callback exists to get data from the driver to the tty core.

而不是传统的 read 函数中,tty 驱动程序负责将从硬件接收到的任何数据发送到 tty 核心。tty 核心缓冲数据,直到用户请求为止。由于 tty 核心提供了缓冲逻辑,因此没有必要每个 tty 驱动程序都实现自己的缓冲逻辑。当用户希望驱动程序停止并开始发送数据时,tty 核心会通知 tty 驱动程序,但如果内部 tty 缓冲区已满,则不会发生此类通知。

Instead of a conventional read function, the tty driver is responsible for sending any data received from the hardware to the tty core when it is received. The tty core buffers the data until it is asked for by the user. Because of the buffering logic the tty core provides, it is not necessary for every tty driver to implement its own buffering logic. The tty core notifies the tty driver when a user wants the driver to stop and start sending data, but if the internal tty buffers are full, no such notification occurs.

tty 核心将 tty 驱动程序接收到的数据缓冲在一个称为 struct tty_flip_buffer。翻转缓冲区是包含两个主要数据数组的结构。从 tty 设备接收的数据存储在第一个数组中。当该数组已满时,任何等待数据的用户都会收到通知,可以读取数据。当用户从此数组读取数据时,任何新传入的数据都将存储在第二个数组中。当该数组完成后,数据再次刷新给用户,并且驱动程序开始填充第一个数组。本质上,接收到的数据从一个缓冲区“翻转”到另一个缓冲区,希望不会溢出两个缓冲区。为了防止数据丢失,tty 驱动程序可以监视传入数组的大小,如果数组已满,

The tty core buffers the data received by the tty drivers in a structure called struct tty_flip_buffer. A flip buffer is a structure that contains two main data arrays. Data being received from the tty device is stored in the first array. When that array is full, any user waiting on the data is notified that data is available to be read. While the user is reading the data from this array, any new incoming data is being stored in the second array. When that array is finished, the data is again flushed to the user, and the driver starts to fill up the first array. Essentially, the data being received "flips" from one buffer to the other, hopefully not overflowing both of them. To try to prevent data from being lost, a tty driver can monitor how big the incoming array is, and, if it fills up, tell the tty driver to flush the buffer at this moment in time, instead of waiting for the next available chance.

的详细信息struct tty_flip_buffer结构对于 tty 驱动程序来说并不重要,但变量count. 该变量包含缓冲区中当前剩余用于接收数据的字节数。如果该值等于 value TTY_FLIPBUF_SIZE,则需要通过调用tty_flip_buffer_push将翻转缓冲区刷新给用户。这如下面的代码所示:

The details of the struct tty_flip_buffer structure do not really matter to the tty driver, with one exception, the variable count. This variable contains how many bytes are currently left in the buffer that are being used for receiving data. If this value is equal to the value TTY_FLIPBUF_SIZE, the flip buffer needs to be flushed out to the user with a call to tty_flip_buffer_push. This is shown in the following bit of code:

for (i = 0; i < 数据大小; ++i) {
    if (tty->flip.count >= TTY_FLIPBUF_SIZE)
        tty_flip_buffer_push(tty);
    tty_insert_flip_char(tty, 数据[i], TTY_NORMAL);
}
tty_flip_buffer_push(tty);
for (i = 0; i < data_size; ++i) {
    if (tty->flip.count >= TTY_FLIPBUF_SIZE)
        tty_flip_buffer_push(tty);
    tty_insert_flip_char(tty, data[i], TTY_NORMAL);
}
tty_flip_buffer_push(tty);

通过调用tty_insert_flip_char将从 tty 驱动程序接收到的要发送给用户的字符添加到翻转缓冲区中。该函数的第一个参数是struct tty_struct应保存的数据,第二个参数是要保存的字符,第三个参数是应为此字符设置的任何标志。TTY_NORMAL如果这是接收到的正常字符,则应将标志值设置为。如果这是指示接收数据错误的特殊类型字符,则应根据错误将其设置为TTY_BREAKTTY_FRAMETTY_PARITY或。TTY_OVERRUN

Characters that are received from the tty driver to be sent to the user are added to the flip buffer with a call to tty_insert_flip_char. The first parameter of this function is the struct tty_struct the data should be saved in, the second parameter is the character to be saved, and the third parameter is any flags that should be set for this character. The flags value should be set to TTY_NORMAL if this is a normal character being received. If this is a special type of character indicating an error receiving data, it should be set to TTY_BREAK, TTY_FRAME, TTY_PARITY, or TTY_OVERRUN, depending on the error.

为了将数据“推送”给用户, 需要调用tty_flip_buffer_push 。如果翻转缓冲区即将溢出,也应该调用此函数,如此例所示。因此,每当数据添加到翻转缓冲区时,或者当翻转缓冲区已满时,tty 驱动程序必须调用 tty_flip_buffer_push。如果 tty 驱动程序可以以非常高的速率接受数据,tty->low_latency则应设置该标志,这会导致对tty_flip_buffer_push的调用在调用时立即执行。否则,tty_flip_buffer_push调用会安排自身将数据推出 在不久的将来的某个时候缓冲区。

In order to "push" the data to the user, a call to tty_flip_buffer_push is made. This function should also be called if the flip buffer is about to overflow, as is shown in this example. So whenever data is added to the flip buffer, or when the flip buffer is full, the tty driver must call tty_flip_buffer_push. If the tty driver can accept data at very high rates, the tty->low_latency flag should be set, which causes the call to tty_flip_buffer_push to be immediately executed when called. Otherwise, the tty_flip_buffer_push call schedules itself to push the data out of the buffer at some later point in the near future.

TTY 线路设置

TTY Line Settings

当用户想要 更改 tty 设备的线路设置或检索当前线路设置,他会进行许多不同的 termios 用户空间库函数调用之一,或直接 在 tty 设备节点上进行ioctl调用。tty 核心将这两个接口转换为许多不同的 tty 驱动程序函数回调和 ioctl调用。

When a user wants to change the line settings of a tty device or retrieve the current line settings, he makes one of the many different termios user-space library function calls or directly makes an ioctl call on the tty device node. The tty core converts both of these interfaces into a number of different tty driver function callbacks and ioctl calls.

设置termios

set_termios

大多数 termios 用户空间函数由库转换为对驱动程序节点的ioctl调用。然后,tty 核心将大量不同的 tty ioctl调用转换为对 tty 驱动程序的单个set_termios 函数调用。set_termios回调需要确定要求更改哪些线路设置,然后在 tty 设备中进行这些更改tty 驱动程序必须能够解码 termios 结构中的所有不同设置并对任何所需的更改做出反应。这是一项复杂的任务,因为所有线路设置都以多种方式打包到 termios 结构中。

The majority of the termios user-space functions are translated by the library into an ioctl call to the driver node. A large number of the different tty ioctl calls are then translated by the tty core into a single set_termios function call to the tty driver. The set_termios callback needs to determine which line settings it is being asked to change, and then make those changes in the tty device. The tty driver must be able to decode all of the different settings in the termios structure and react to any needed changes. This is a complicated task, as all of the line settings are packed into the termios structure in a wide variety of ways.

set_termios函数应该做的第一件事是确定是否确实需要更改任何内容。这可以通过以下代码完成:

The first thing that a set_termios function should do is determine whether anything actually has to be changed. This can be done with the following code:

无符号整型 cflag;

cflag = tty->termios->c_cflag;

/* 检查他们是否真的希望我们改变一些东西 */
如果(old_termios){
    if ((cflag == old_termios->c_cflag) &&
        (RELEVANT_IFLAG(tty->termios->c_iflag) = =
         RELEVANT_IFLAG(old_termios->c_iflag))) {
        printk(KERN_DEBUG " - 没有什么可改变的...\n");
        返回;
    }
}
unsigned int cflag;

cflag = tty->termios->c_cflag;

/* check that they really want us to change something */
if (old_termios) {
    if ((cflag =  = old_termios->c_cflag) &&
        (RELEVANT_IFLAG(tty->termios->c_iflag) =  = 
         RELEVANT_IFLAG(old_termios->c_iflag))) {
        printk(KERN_DEBUG " - nothing to change...\n");
        return;
    }
}

RELEVANT_IFLAG 宏定义为:

The RELEVANT_IFLAG macro is defined as:

#define RELEVANT_IFLAG(iflag) ((iflag) & (IGNBRK|BRKINT|IGNPAR|PARMRK|INPCK))
#define RELEVANT_IFLAG(iflag) ((iflag) & (IGNBRK|BRKINT|IGNPAR|PARMRK|INPCK))

并用于屏蔽变量的重要位cflags。然后将其与旧值进行比较,看看它们是否不同。如果没有,则无需更改任何内容,因此我们返回。请注意,old_termios在访问该变量之前,首先检查它是否指向有效的结构。这是必需的,因为有时此变量被设置为NULL。尝试访问 NULL指针之外的字段会导致内核崩溃。

and is used to mask off the important bits of the cflags variable. This is then compared to the old value, and see if they differ. If not, nothing needs to be changed, so we return. Note that the old_termios variable is first checked to see if it points to a valid structure first, before it is accessed. This is required, as sometimes this variable is set to NULL. Trying to access a field off of a NULL pointer causes the kernel to panic.

要查看请求的字节大小,CSIZE 位掩码可用于从cflag 变量中分离出正确的位。如果无法确定大小,通常默认为八个数据位。这可以按如下方式实现:

To look at the requested byte size, the CSIZE bitmask can be used to separate out the proper bits from the cflag variable. If the size can not be determined, it is customary to default to eight data bits. This can be implemented as follows:

/* 获取字节大小 */
开关(cflag 和 CSIZE){
    案例CS5:
        printk(KERN_DEBUG " - 数据位 = 5\n");
        休息;
    案例CS6:
        printk(KERN_DEBUG " - 数据位 = 6\n");
        休息;
    案例CS7:
        printk(KERN_DEBUG " - 数据位 = 7\n");
        休息;
    默认:
    案例CS8:
        printk(KERN_DEBUG " - 数据位 = 8\n");
        休息;
}
/* get the byte size */
switch (cflag & CSIZE) {
    case CS5:
        printk(KERN_DEBUG " - data bits = 5\n");
        break;
    case CS6:
        printk(KERN_DEBUG " - data bits = 6\n");
        break;
    case CS7:
        printk(KERN_DEBUG " - data bits = 7\n");
        break;
    default:
    case CS8:
        printk(KERN_DEBUG " - data bits = 8\n");
        break;
}

为了确定请求的奇偶校验值,PARENB 可以根据变量检查位掩码cflag以判断是否要设置奇偶校验。如果是这样,则PARODD 位掩码可用于确定奇偶校验是奇数还是偶数。其实现是:

To determine the requested parity value, the PARENB bitmask can be checked against the cflag variable to tell if any parity is to be set at all. If so, the PARODD bitmask can be used to determine if the parity should be odd or even. An implementation of this is:

/* 判断奇偶校验 */
如果(cflag 和 PARENB)
    if(cflag 和 PARODD)
        printk(KERN_DEBUG " - 奇偶校验 = 奇数\n");
    别的
        printk(KERN_DEBUG " - 奇偶校验 = 偶数\n");
别的
    printk(KERN_DEBUG " - 奇偶校验 = 无\n");
/* determine the parity */
if (cflag & PARENB)
    if (cflag & PARODD)
        printk(KERN_DEBUG " - parity = odd\n");
    else
        printk(KERN_DEBUG " - parity = even\n");
else
    printk(KERN_DEBUG " - parity = none\n");

cflag所请求的停止位也可以使用以下命令从变量中确定CSTOPB 位掩码。其实现是:

The stop bits that are requested can also be determined from the cflag variable using the CSTOPB bitmask. An implemention of this is:

/* 找出请求的停止位 */
如果(cflag 和 CSTOPB)
    printk(KERN_DEBUG " - 停止位 = 2\n");
别的
    printk(KERN_DEBUG " - 停止位 = 1\n");
/* figure out the stop bits requested */
if (cflag & CSTOPB)
    printk(KERN_DEBUG " - stop bits = 2\n");
else
    printk(KERN_DEBUG " - stop bits = 1\n");

流量控制有两种基本类型:硬件和软件。为了确定用户是否要求硬件流控制,CRTSCTS 可以根据变量检查位掩码cflag。一个例子是:

There are a two basic types of flow control: hardware and software. To determine if the user is asking for hardware flow control, the CRTSCTS bitmask can be checked against the cflag variable. An exmple of this is:

/* 计算出硬件流控制设置 */
if(cflag 和 CRTSCTS)
    printk(KERN_DEBUG " - RTS/CTS 已启用\n");
别的
    printk(KERN_DEBUG " - RTS/CTS 已禁用\n");
/* figure out the hardware flow control settings */
if (cflag & CRTSCTS)
    printk(KERN_DEBUG " - RTS/CTS is enabled\n");
else
    printk(KERN_DEBUG " - RTS/CTS is disabled\n");

确定软件流控制的不同模式以及不同的停止和开始字符有点复杂:

Determining the different modes of software flow control and the different stop and start characters is a bit more involved:

/* 确定软件流控制 */
/* 如果我们正在实现 XON/XOFF,请设置开始和
 * 设备中的停止字符 */
如果 (I_IXOFF(tty) || I_IXON(tty)) {
    无符号字符 stop_char = STOP_CHAR(tty);
    无符号字符 start_char = START_CHAR(tty);

    /* 如果我们正在实施 INBOUND XON/XOFF */
    如果 (I_IXOFF(tty))
        printk(KERN_DEBUG " - INBOUND XON/XOFF 已启用,"
            “XON = %2x,XOFF = %2x”,start_char,stop_char);
    别的
        printk(KERN_DEBUG" - INBOUND XON/XOFF 已禁用");

    /* 如果我们正在实施 OUTBOUND XON/XOFF */
    如果(I_IXON(tty))
        printk(KERN_DEBUG" - OUTBOUND XON/XOFF 已启用,"
            “XON = %2x,XOFF = %2x”,start_char,stop_char);
    别的
        printk(KERN_DEBUG" - 出站 XON/XOFF 已禁用");
}
/* determine software flow control */
/* if we are implementing XON/XOFF, set the start and 
 * stop character in the device */
if (I_IXOFF(tty) || I_IXON(tty)) {
    unsigned char stop_char  = STOP_CHAR(tty);
    unsigned char start_char = START_CHAR(tty);

    /* if we are implementing INBOUND XON/XOFF */
    if (I_IXOFF(tty))
        printk(KERN_DEBUG " - INBOUND XON/XOFF is enabled, "
            "XON = %2x, XOFF = %2x", start_char, stop_char);
    else
        printk(KERN_DEBUG" - INBOUND XON/XOFF is disabled");

    /* if we are implementing OUTBOUND XON/XOFF */
    if (I_IXON(tty))
        printk(KERN_DEBUG" - OUTBOUND XON/XOFF is enabled, "
            "XON = %2x, XOFF = %2x", start_char, stop_char);
    else
        printk(KERN_DEBUG" - OUTBOUND XON/XOFF is disabled");
}

最后,需要确定波特率。tty核心提供了一个函数, tty_get_baud_rate ,帮助做到这一点。该函数返回一个整数,指示特定 tty 设备请求的波特率:

Finally, the baud rate needs to be determined. The tty core provides a function, tty_get_baud_rate , to help do this. The function returns an integer indicating the requested baud rate for the specific tty device:

/* 获取想要的波特率 */
printk(KERN_DEBUG " - 波特率 = %d", tty_get_baud_rate(tty));
/* get the baud rate wanted */
printk(KERN_DEBUG " - baud rate = %d", tty_get_baud_rate(tty));

现在 tty 驱动程序已经确定了所有不同的线路设置,它可以根据这些值正确设置硬件。

Now that the tty driver has determined all of the different line settings, it can set the hardware up properly based on these values.

tiocmget 和 tiocmset

tiocmget and tiocmset

在 2.4 及更早版本的内核中,曾经有许多 tty ioctl调用来获取和设置不同的控制线设置。这些由常数TIOCMGETTIOCMBISTIOCMBIC和 表示TIOCMSETTIOCMGET 用于获取内核的行设置值,从 2.6 内核开始,此 ioctl调用已转变为名为tiocmget的 tty 驱动程序回调函数。其他三个ioctl已得到简化,现在用一个名为 tiocmset的 tty 驱动程序回调函数表示

In the 2.4 and older kernels, there used to be a number of tty ioctl calls to get and set the different control line settings. These were denoted by the constants TIOCMGET, TIOCMBIS, TIOCMBIC, and TIOCMSET. TIOCMGET was used to get the line setting values of the kernel, and as of the 2.6 kernel, this ioctl call has been turned into a tty driver callback function called tiocmget. The other three ioctls have been simplified and are now represented with a single tty driver callback function called tiocmset .

tiocmget _ 当 tty 核心想要知道特定 tty 设备的控制线的当前物理值时,tty 核心会调用 tty 驱动程序中的函数。这通常是为了检索串行端口的 DTR 和 RTS 线的值。如果tty驱动程序无法直接读取串行端口的MSR或MCR寄存器,因为硬件不允许这样做,则应在本地保留它们的副本。许多 USB 转串口驱动程序必须实现这种“影子”变量。如果保留这些值的本地副本,则可以实现此函数的方式如下:

The tiocmget function in the tty driver is called by the tty core when the core wants to know the current physical values of the control lines of a specific tty device. This is usually done to retrieve the values of the DTR and RTS lines of a serial port. If the tty driver cannot directly read the MSR or MCR registers of the serial port, because the hardware does not allow this, a copy of them should be kept locally. A number of the USB-to-serial drivers must implement this kind of "shadow" variable. Here is how this function could be implemented if a local copy of these values are kept:

static int tiny_tiocmget(struct tty_struct *tty, struct file *file)
{
    struct tiny_serial *tiny = tty->driver_data;

    无符号整型结果 = 0;
    无符号整型 msr = tiny->msr;
    无符号整型 mcr = tiny->mcr;

    结果 = ((mcr & MCR_DTR) ? TIOCM_DTR : 0) | /* DTR 设置 */
             ((mcr & MCR_RTS) ? TIOCM_RTS : 0) | /* 设置RTS */
             ((mcr & MCR_LOOP) ? TIOCM_LOOP : 0) | /* 设置循环 */
             ((msr & MSR_CTS) ? TIOCM_CTS: 0) | /* 设置CTS */
             ((msr & MSR_CD) ? TIOCM_CAR : 0) | /* 设置载波检测*/
             ((msr & MSR_RI) ? TIOCM_RI : 0) | /* 设置响铃指示器 */
             ((msr&MSR_DSR)?TIOCM_DSR:0);/* DSR已设置*/

    返回结果;
}
static int tiny_tiocmget(struct tty_struct *tty, struct file *file)
{
    struct tiny_serial *tiny = tty->driver_data;

    unsigned int result = 0;
    unsigned int msr = tiny->msr;
    unsigned int mcr = tiny->mcr;

    result = ((mcr & MCR_DTR)  ? TIOCM_DTR  : 0) |  /* DTR is set */
             ((mcr & MCR_RTS)  ? TIOCM_RTS  : 0) |  /* RTS is set */
             ((mcr & MCR_LOOP) ? TIOCM_LOOP : 0) |  /* LOOP is set */
             ((msr & MSR_CTS)  ? TIOCM_CTS  : 0) |  /* CTS is set */
             ((msr & MSR_CD)   ? TIOCM_CAR  : 0) |  /* Carrier detect is set*/
             ((msr & MSR_RI)   ? TIOCM_RI   : 0) |  /* Ring Indicator is set */
             ((msr & MSR_DSR)  ? TIOCM_DSR  : 0);   /* DSR is set */

    return result;
}

当 tty 核心想要设置特定 tty 设备的控制线的值时,tty 核心会调用 tty 驱动程序中的tiocmset函数。tty 核心通过将它们传递到两个变量中来告诉 tty 驱动程序要设置哪些值以及要清除哪些值:setclear。这些变量包含应更改的线路设置的位掩码。ioctl调用从不要求驱动程序同时设置和清除特定位,因此哪个操作先发生并不重要。以下是 tty 驱动程序如何实现此功能的示例:

The tiocmset function in the tty driver is called by the tty core when the core wants to set the values of the control lines of a specific tty device. The tty core tells the tty driver what values to set and what to clear, by passing them in two variables: set and clear. These variables contain a bitmask of the lines settings that should be changed. An ioctl call never asks the driver to both set and clear a particular bit at the same time, so it does not matter which operation occurs first. Here is an example of how this function could be implemented by a tty driver:

static int tiny_tiocmset(struct tty_struct *tty, struct file *file,
                         无符号整型设置,无符号整型清除)
{
    struct tiny_serial *tiny = tty->driver_data;
    无符号整型 mcr = tiny->mcr;

    如果(设置&TIOCM_RTS)
        mcr |= MCR_RTS;
    如果(设置&TIOCM_DTR)
        mcr |= MCR_RTS;

    if(清除&TIOCM_RTS)
        mcr &= ~MCR_RTS;
    if(清除&TIOCM_DTR)
        mcr &= ~MCR_RTS;

    /* 在设备中设置新的MCR值*/
    小->mcr = mcr;
    返回0;
}
static int tiny_tiocmset(struct tty_struct *tty, struct file *file,
                         unsigned int set, unsigned int clear)
{
    struct tiny_serial *tiny = tty->driver_data;
    unsigned int mcr = tiny->mcr;

    if (set & TIOCM_RTS)
        mcr |= MCR_RTS;
    if (set & TIOCM_DTR)
        mcr |= MCR_RTS;

    if (clear & TIOCM_RTS)
        mcr &= ~MCR_RTS;
    if (clear & TIOCM_DTR)
        mcr &= ~MCR_RTS;

    /* set the new MCR value in the device */
    tiny->mcr = mcr;
    return 0;
}

读写控制

ioctls

读写控制 当在设备节点上调用ioctlstruct tty_driver (2)时,tty 核心会调用 回调函数。如果 tty 驱动程序不知道如何处理传递给它的ioctl值,它应该返回并尝试让 tty 核心实现调用的通用版本。-ENOIOCTLCMD

The ioctl function callback in the struct tty_driver is called by the tty core when ioctl(2) is called on the device node. If the tty driver does not know how to handle the ioctl value passed to it, it should return -ENOIOCTLCMD to try to let the tty core implement a generic version of the call.

2.6 内核定义了大约 70 个可以发送到 tty 驱动程序的不同 tty ioctl 。大多数 tty 驱动程序不能处理所有这些,而只能处理更常见的一小部分。以下是更流行的 tty ioctl列表、它们的含义以及如何实现它们:

The 2.6 kernel defines about 70 different tty ioctls that can be be sent to a tty driver. Most tty drivers do not handle all of these, but only a small subset of the more common ones. Here is a list of the more popular tty ioctls, what they mean, and how to implement them:

TIOCSERGETLSR
TIOCSERGETLSR

获取此 tty 设备的值 线路状态寄存器(LSR)。

Gets the value of this tty device's line status register (LSR).

TIOCGSERIAL
TIOCGSERIAL

获取串行线信息。调用者可能会在此调用中一次从 tty 设备获取大量串行线路信息。某些程序(例如 setserialdip)调用此函数来确保正确设置波特率并获取有关tty驱动程序控制的设备类型的一般信息。调用者传入一个指向 type 的大型结构的指针serial_struct,tty 驱动程序应使用正确的值填充该结构。以下是如何实现这一点的示例:

Gets the serial line information. A caller can potentially get a lot of serial line information from the tty device all at once in this call. Some programs (such as setserial and dip) call this function to make sure that the baud rate was properly set and to get general information on what type of device the tty driver controls. The caller passes in a pointer to a large struct of type serial_struct, which the tty driver should fill up with the proper values. Here is an example of how this can be implemented:

static int tiny_ioctl(struct tty_struct *tty, struct file *file,
                      无符号整型 cmd,无符号长 arg)
{
    struct tiny_serial *tiny = tty->driver_data;
    如果(cmd == TIOCGSERIAL){
        结构serial_struct tmp;
        如果(!arg)
            返回-EFAULT;
        memset(&tmp, 0, sizeof(tmp));
        tmp.type = tiny->serial.type;
        tmp.line = tiny->serial.line;
        tmp.port = tiny->serial.port;
        tmp.irq = tiny->serial.irq;
        tmp.flags = ASYNC_SKIP_TEST | ASYNC_AUTO_IRQ;
        tmp.xmit_fifo_size = tiny->serial.xmit_fifo_size;
        tmp.baud_base = tiny->serial.baud_base;
        tmp.close_delay = 5*HZ;
        tmp.close_wait = 30*HZ;
        tmp.custom_divisor = tiny->serial.custom_divisor;
        tmp.hub6 = 小->serial.hub6;
        tmp.io_type = tiny->serial.io_type;
        if (copy_to_user((void _ _user *)arg, &tmp, sizeof(tmp)))
            返回-EFAULT;
        返回0;
    }
    返回-ENOIOCTLCMD;
}
static int tiny_ioctl(struct tty_struct *tty, struct file *file,
                      unsigned int cmd, unsigned long arg)
{
    struct tiny_serial *tiny = tty->driver_data;
    if (cmd =  = TIOCGSERIAL) {
        struct serial_struct tmp;
        if (!arg)
            return -EFAULT;
        memset(&tmp, 0, sizeof(tmp));
        tmp.type        = tiny->serial.type;
        tmp.line        = tiny->serial.line;
        tmp.port        = tiny->serial.port;
        tmp.irq         = tiny->serial.irq;
        tmp.flags       = ASYNC_SKIP_TEST | ASYNC_AUTO_IRQ;
        tmp.xmit_fifo_size  = tiny->serial.xmit_fifo_size;
        tmp.baud_base       = tiny->serial.baud_base;
        tmp.close_delay     = 5*HZ;
        tmp.closing_wait    = 30*HZ;
        tmp.custom_divisor  = tiny->serial.custom_divisor;
        tmp.hub6        = tiny->serial.hub6;
        tmp.io_type     = tiny->serial.io_type;
        if (copy_to_user((void _ _user *)arg, &tmp, sizeof(tmp)))
            return -EFAULT;
        return 0;
    }
    return -ENOIOCTLCMD;
}
TIOCSSERIAL
TIOCSSERIAL

设置 串行线信息。这与 tty 设备的串行线路状态相反TIOCGSERIAL并允许用户一次性设置所有状态。指向 a 的指针struct serial_struct被传递给此调用,其中充满了 tty 设备现在应设置的数据。如果tty驱动程序没有实现这个调用,大多数程序仍然可以正常工作。

Sets the serial line information. This is the opposite of TIOCGSERIAL and allows the user to set the serial line status of the tty device all at once. A pointer to a struct serial_struct is passed to this call, full of data that the tty device should now be set to. If the tty driver does not implement this call, most programs still works properly.

TIOCMIWAIT
TIOCMIWAIT

等待 MSR 更改。用户在异常情况下请求此ioctl,它希望在内核中休眠,直到发生某些情况为止 tty 设备的 MSR 寄存器。该arg参数包含用户正在等待的事件类型。这通常用于等待状态行发生变化,表明更多数据已准备好发送到设备。

Waits for MSR change. The user asks for this ioctl in the unusual circumstances that it wants to sleep within the kernel until something happens to the MSR register of the tty device. The arg parameter contains the type of event that the user is waiting for. This is commonly used to wait until a status line changes, signaling that more data is ready to be sent to the device.

实现此ioctl时要小心,并且不要使用 Interruptible_sleep_on调用,因为它是不安全的(其中涉及很多令人讨厌的竞争条件)。相反,应该使用wait_queue来避免这些问题。下面是如何实现此 ioctl 的示例:

Be careful when implementing this ioctl, and do not use the interruptible_sleep_on call, as it is unsafe (there are lots of nasty race conditions involved with it). Instead, a wait_queue should be used to avoid these problems. Here's an example of how to implement this ioctl:

static int tiny_ioctl(struct tty_struct *tty, struct file *file,
                      无符号整型 cmd,无符号长 arg)
{
    struct tiny_serial *tiny = tty->driver_data;
    如果(cmd == TIOCMIWAIT){
        DECLARE_WAITQUEUE(等待,当前);
        结构 async_icount 知道;
        struct async_icount cprev;
        cprev = 小->icount;
        而(1){
            add_wait_queue(&tiny->wait, &wait);
            set_current_state(TASK_INTERRUPTIBLE);
            日程( );
            remove_wait_queue(&tiny->wait, &wait);
            /* 看看是否有信号唤醒我们 */
            if (signal_pending(当前))
                返回-ERESTARTSYS;
            知道 = 小->icount;
            if (cnow.rng == cprev.rng && cnow.dsr == cprev.dsr &&
                cnow.dcd == cprev.dcd && cnow.cts == cprev.cts)
                返回-EIO;/* 没有变化 => 错误 */
            if (((arg & TIOCM_RNG) && (cnow.rng != cprev.rng)) ||
                ((arg & TIOCM_DSR) && (cnow.dsr != cprev.dsr)) ||
                ((arg & TIOCM_CD) && (cnow.dcd != cprev.dcd)) ||
                ((arg & TIOCM_CTS) && (cnow.cts != cprev.cts)) ) {
                返回0;
            }
            cprev = 已知;
        }
    }
    返回-ENOIOCTLCMD;
}
static int tiny_ioctl(struct tty_struct *tty, struct file *file,
                      unsigned int cmd, unsigned long arg)
{
    struct tiny_serial *tiny = tty->driver_data;
    if (cmd =  = TIOCMIWAIT) {
        DECLARE_WAITQUEUE(wait, current);
        struct async_icount cnow;
        struct async_icount cprev;
        cprev = tiny->icount;
        while (1) {
            add_wait_queue(&tiny->wait, &wait);
            set_current_state(TASK_INTERRUPTIBLE);
            schedule(  );
            remove_wait_queue(&tiny->wait, &wait);
            /* see if a signal woke us up */
            if (signal_pending(current))
                return -ERESTARTSYS;
            cnow = tiny->icount;
            if (cnow.rng =  = cprev.rng && cnow.dsr =  = cprev.dsr &&
                cnow.dcd =  = cprev.dcd && cnow.cts =  = cprev.cts)
                return -EIO; /* no change => error */
            if (((arg & TIOCM_RNG) && (cnow.rng != cprev.rng)) ||
                ((arg & TIOCM_DSR) && (cnow.dsr != cprev.dsr)) ||
                ((arg & TIOCM_CD)  && (cnow.dcd != cprev.dcd)) ||
                ((arg & TIOCM_CTS) && (cnow.cts != cprev.cts)) ) {
                return 0;
            }
            cprev = cnow;
        }
    }
    return -ENOIOCTLCMD;
}

在 tty 驱动程序代码中识别 MSR 寄存器更改的某个位置,必须调用以下行才能使该代码正常工作:

Somewhere in the tty driver's code that recognizes that the MSR register changes, the following line must be called for this code to work properly:

wake_up_interruptible(&tp->等待);
wake_up_interruptible(&tp->wait);
TIOCGICOUNT
TIOCGICOUNT

获取 中断计数。当用户想知道发生了多少串行线中断时调用此函数。如果驱动程序有一个中断处理程序,它应该定义一个计数器的内部结构来跟踪这些统计数据,并在每次内核运行该函数时增加适当的计数器。

Gets interrupt counts. This is called when the user wants to know how many serial line interrupts have happened. If the driver has an interrupt handler, it should define an internal structure of counters to keep track of these statistics and increment the proper counter every time the function is run by the kernel.

ioctl调用向内核传递一个指向结构的指针serial_icounter_struct ,应该由 tty 驱动程序填充。TIOCMIWAIT 此调用通常与先前的ioctl调用结合进行。如果 tty 驱动程序在驱动程序运行时跟踪所有这些中断,则实现此调用的代码可能非常简单:

This ioctl call passes the kernel a pointer to a structure serial_icounter_struct , which should be filled by the tty driver. This call is often made in conjunction with the previous TIOCMIWAIT ioctl call. If the tty driver keeps track of all of these interrupts while the driver is operating, the code to implement this call can be very simple:

static int tiny_ioctl(struct tty_struct *tty, struct file *file,
                      无符号整型 cmd,无符号长 arg)
{
    struct tiny_serial *tiny = tty->driver_data;
    如果(cmd == TIOCGICOUNT){
        struct async_icount cnow = tiny->icount;
        结构体serial_icounter_struct icount;
        icount.cts = 知道.cts;
        icount.dsr = cnow.dsr;
        icount.rng = 知道.rng;
        icount.dcd = cnow.dcd;
        icount.rx = cnow.rx;
        icount.tx = cnow.tx;
        icount.frame = cnow.frame;
        icount.overrun = cnow.overrun;
        icount.parity = cnow.parity;
        icount.brk = cnow.brk;
        icount.buf_overrun = cnow.buf_overrun;
        if (copy_to_user((void _ _user *)arg, &icount, sizeof(icount)))
            返回-EFAULT;
        返回0;
    }
    返回-ENOIOCTLCMD;




}
static int tiny_ioctl(struct tty_struct *tty, struct file *file,
                      unsigned int cmd, unsigned long arg)
{
    struct tiny_serial *tiny = tty->driver_data;
    if (cmd =  = TIOCGICOUNT) {
        struct async_icount cnow = tiny->icount;
        struct serial_icounter_struct icount;
        icount.cts  = cnow.cts;
        icount.dsr  = cnow.dsr;
        icount.rng  = cnow.rng;
        icount.dcd  = cnow.dcd;
        icount.rx   = cnow.rx;
        icount.tx   = cnow.tx;
        icount.frame    = cnow.frame;
        icount.overrun  = cnow.overrun;
        icount.parity   = cnow.parity;
        icount.brk  = cnow.brk;
        icount.buf_overrun = cnow.buf_overrun;
        if (copy_to_user((void _ _user *)arg, &icount, sizeof(icount)))
            return -EFAULT;
        return 0;
    }
    return -ENOIOCTLCMD;




}

TTY 设备的 proc 和 sysfs 处理

proc and sysfs Handling of TTY Devices

tty 核心提供了一个非常简单的 任何 tty 驱动程序在/proc/tty/driver目录中维护文件的方法。如果驱动程序定义了 read_procwrite_proc函数,则会创建此文件。然后,对此文件的任何读取或写入调用都会发送到驱动程序。这些函数的格式与标准/proc文件处理函数类似。

The tty core provides a very easy way for any tty driver to maintain a file in the /proc/tty/driver directory. If the driver defines the read_proc or write_proc functions, this file is created. Then, any read or write call on this file is sent to the driver. The formats of these functions are just like the standard /proc file-handling functions.

作为示例,下面是read_proc tty 回调的简单实现,它仅打印出当前注册的端口号:

As an example, here is a simple implementation of the read_proc tty callback that merely prints out the number of the currently registered ports:

static int tiny_read_proc(char *page, char **start, off_t off, int count,
                          int *eof,void *数据)
{
    结构体tiny_serial *tiny;
    off_t 开始 = 0;
    整数长度=0;
    整数我;

    length += sprintf(page, "tinyserinfo:1.0 驱动程序:%s\n", DRIVER_VERSION);
    for (i = 0; i < TINY_TTY_MINORS && 长度 < PAGE_SIZE; ++i) {
        微小=微小_表[i];
        如果(微小== NULL)
            继续;

        长度+= sprintf(页+长度, "%d\n", i);
        if ((长度 + 开始) > (关闭 + 计数))
            继续完成;
        if ((长度 + 开始) < 关闭) {
            开始+=长度;
            长度=0;
        }
    }
    *eof = 1;
完毕:
    if (关闭 >= (长度 + 开始))
        返回0;
    *开始=页面+(关闭开始);
    返回(计数<开始+长度关闭)?计数:开始+长度结束;
}
static int tiny_read_proc(char *page, char **start, off_t off, int count,
                          int *eof, void *data)
{
    struct tiny_serial *tiny;
    off_t begin = 0;
    int length = 0;
    int i;

    length += sprintf(page, "tinyserinfo:1.0 driver:%s\n", DRIVER_VERSION);
    for (i = 0; i < TINY_TTY_MINORS && length < PAGE_SIZE; ++i) {
        tiny = tiny_table[i];
        if (tiny =  = NULL)
            continue;

        length += sprintf(page+length, "%d\n", i);
        if ((length + begin) > (off + count))
            goto done;
        if ((length + begin) < off) {
            begin += length;
            length = 0;
        }
    }
    *eof = 1;
done:
    if (off >= (length + begin))
        return 0;
    *start = page + (off-begin);
    return (count < begin+length-off) ? count : begin + length-off;
}

当注册 tty 驱动程序或创建各个 tty 设备时,tty 核心会处理所有 sysfs 目录和设备创建,具体TTY_DRIVER_NO_DEVFS取决于struct tty_driver. 单独的目录始终包含 dev文件,该文件允许用户空间工具确定分配给设备的主设备号和次设备号。如果在对tty_register_device 的调用中传递了 指向 valid 的指针,它还包含设备驱动程序符号链接。除了这三个文件之外,各个 tty 驱动程序无法在此位置创建新的 sysfs 文件。这可能会在未来的内核版本中发生变化。struct device

The tty core handles all of the sysfs directory and device creation when the tty driver is registered, or when the individual tty devices are created, depending on the TTY_DRIVER_NO_DEVFS flag in the struct tty_driver. The individual directory always contains the dev file, which allows user-space tools to determine the major and minor number assigned to the device. It also contains a device and driver symlink, if a pointer to a valid struct device is passed in the call to tty_register_device. Other than these three files, it is not possible for individual tty drivers to create new sysfs files in this location. This will probably change in future kernel releases.

tty_driver 结构详细信息

The tty_driver Structure in Detail

结构tty_driver 用于向 tty 核心注册 tty 驱动程序。以下是结构中所有不同字段以及 tty 核心如何使用它们的列表:

The tty_driver structure is used to register a tty driver with the tty core. Here is a list of all of the different fields in the structure and how they are used by the tty core:

struct module *owner;
struct module *owner;

该驱动程序的模块所有者。

The module owner for this driver.

int magic;
int magic;

该结构的“神奇”值。应始终设置为TTY_DRIVER_MAGIC. 在alloc_tty_driver函数中初始化 。

The "magic" value for this structure. Should always be set to TTY_DRIVER_MAGIC. Is initialized in the alloc_tty_driver function.

const char *driver_name;
const char *driver_name;

驱动程序的名称,在/proc/tty和 sysfs 中使用。

Name of the driver, used in /proc/tty and sysfs.

const char *name;
const char *name;

驱动程序的节点名称。

Node name of the driver.

int name_base;
int name_base;

为设备创建名称时使用的起始编号。当内核创建分配给 tty 驱动程序的特定 tty 设备的字符串表示形式时,将使用它。

Starting number to use when creating names for devices. This is used when the kernel creates a string representation of a specific tty device assigned to the tty driver.

short major;
short major;

驱动程序的主号码。

Major number for the driver.

short minor_start;
short minor_start;

驱动程序的起始次要编号。这通常设置为与 相同的值 name_base。通常,该值设置为 0

Starting minor number for the driver. This is usually set to the same value as name_base. Typically, this value is set to 0.

short num;
short num;

分配给驱动程序的次要号码的数量。如果驱动程序使用整个主编号范围,则该值应设置为 255。该变量在alloc_tty_driver函数中初始化。

Number of minor numbers assigned to the driver. If an entire major number range is used by the driver, this value should be set to 255. This variable is initialized in the alloc_tty_driver function.

short type;

short subtype;
short type;

short subtype;

描述在 tty 核心中注册的 tty 驱动程序类型。的值subtype取决于type。该type字段可以是:

Describe what kind of tty driver is being registered with the tty core. The value of subtype depends on the type. The type field can be:

TTY_DRIVER_TYPE_SYSTEM
TTY_DRIVER_TYPE_SYSTEM

由 tty 子系统在内部使用,以记住它正在处理内部 tty 驱动程序。subtype应设置为 SYSTEM_TYPE_TTYSYSTEM_TYPE_CONSOLESYSTEM_TYPE_SYSCONS、 或SYSTEM_TYPE_SYSPTMX。任何“普通”tty 驱动程序都不应该使用此类型。

Used internally by the tty subsystem to remember that it is dealing with an internal tty driver. subtype should be set to SYSTEM_TYPE_TTY, SYSTEM_TYPE_CONSOLE, SYSTEM_TYPE_SYSCONS, or SYSTEM_TYPE_SYSPTMX. This type should not be used by any "normal" tty driver.

TTY_DRIVER_TYPE_CONSOLE
TTY_DRIVER_TYPE_CONSOLE

仅由控制台驱动程序使用。

Used only by the console driver.

TTY_DRIVER_TYPE_SERIAL
TTY_DRIVER_TYPE_SERIAL

由任何串行类型驱动程序使用。subtype应设置为SERIAL_TYPE_NORMALSERIAL_TYPE_CALLOUT,具体取决于您的驱动程序的类型。这是该领域最常见的设置之一type

Used by any serial type driver. subtype should be set to SERIAL_TYPE_NORMAL or SERIAL_TYPE_CALLOUT, depending on which type your driver is. This is one of the most common settings for the type field.

TTY_DRIVER_TYPE_PTY
TTY_DRIVER_TYPE_PTY

由伪终端接口 (pty) 使用。subtype需要设置为PTY_TYPE_MASTERPTY_TYPE_SLAVE

Used by the pseudo terminal interface (pty). subtype needs to be set to either PTY_TYPE_MASTER or PTY_TYPE_SLAVE.

struct termios init_termios;
struct termios init_termios;

创建设备时的初始 struct termios 值。

Initial struct termios values for the device when it is created.

int flags;
int flags;

驱动程序标志,如本章前面所述。

Driver flags, as described earlier in this chapter.

struct proc_dir_entry *proc_entry;
struct proc_dir_entry *proc_entry;

该驱动程序的/proc条目结构。如果驱动程序实现write_procread_proc函数,则它由 tty 核心创建。该字段不应由 tty 驱动程序本身设置。

This driver's /proc entry structure. It is created by the tty core if the driver implements the write_proc or read_proc functions. This field should not be set by the tty driver itself.

struct tty_driver *other;
struct tty_driver *other;

指向 tty 从驱动程序的指针。它仅由 pty 驱动程序使用,不应由任何其他 tty 驱动程序使用。

Pointer to a tty slave driver. This is used only by the pty driver and should not be used by any other tty driver.

void *driver_state;
void *driver_state;

tty 驱动程序的内部状态。只能由 pty 驱动程序使用。

Internal state of the tty driver. Should be used only by the pty driver.

struct tty_driver *next;

struct tty_driver *prev;
struct tty_driver *next;

struct tty_driver *prev;

链接变量。tty 核心使用这些变量将所有不同的 tty 驱动程序链接在一起,并且不应被任何 tty 驱动程序触及。

Linking variables. These variables are used by the tty core to chain all of the different tty drivers together, and should not be touched by any tty driver.

tty_operations 结构详细信息

The tty_operations Structure in Detail

结构tty_operations_ 包含可由 tty 驱动程序设置并由 tty 核心调用的所有函数回调。目前,该结构中包含的所有函数指针也在该tty_driver 结构中,但很快将仅替换为该结构的一个实例。

The tty_operations structure contains all of the function callbacks that can be set by a tty driver and called by the tty core. Currently, all of the function pointers contained in this structure are also in the tty_driver structure, but that will be replaced soon with only an instance of this structure.

int (*open)(struct tty_struct * tty, struct file * filp);
int (*open)(struct tty_struct * tty, struct file * filp);

开放功能

The open function.

void (*close)(struct tty_struct * tty, struct file * filp);
void (*close)(struct tty_struct * tty, struct file * filp);

关闭函数

The close function.

int (*write)(struct tty_struct * tty, const unsigned char *buf, int count);
int (*write)(struct tty_struct * tty, const unsigned char *buf, int count);

功能

The write function.

void (*put_char)(struct tty_struct *tty, unsigned char ch);
void (*put_char)(struct tty_struct *tty, unsigned char ch);

单字符写入功能。当要将单个字符写入设备时,tty 核心将调用此函数。如果 tty 驱动程序未定义此函数,则当 tty 核心想要发送单个字符时,将调用write函数。

The single-character write function. This function is called by the tty core when a single character is to be written to the device. If a tty driver does not define this function, the write function is called instead when the tty core wants to send a single character.

void (*flush_chars)(struct tty_struct *tty);

void (*wait_until_sent)(struct tty_struct *tty, int timeout);
void (*flush_chars)(struct tty_struct *tty);

void (*wait_until_sent)(struct tty_struct *tty, int timeout);

将数据刷新到硬件的函数。

The function that flushes data to the hardware.

int (*write_room)(struct tty_struct *tty);
int (*write_room)(struct tty_struct *tty);

该函数指示有多少缓冲区是空闲的。

The function that indicates how much of the buffer is free.

int (*chars_in_buffer)(struct tty_struct *tty);
int (*chars_in_buffer)(struct tty_struct *tty);

该函数指示缓冲区有多少空间已充满数据。

The function that indicates how much of the buffer is full of data.

int (*ioctl)(struct tty_struct *tty, struct file * file, unsigned int cmd,

unsigned long arg);
int (*ioctl)(struct tty_struct *tty, struct file * file, unsigned int cmd,

unsigned long arg);

ioctl函数。当在设备节点上调用ioctl(2)时,tty 核心会调用此函数。

The ioctl function. This function is called by the tty core when ioctl(2) is called on the device node.

void (*set_termios)(struct tty_struct *tty, struct termios * old);
void (*set_termios)(struct tty_struct *tty, struct termios * old);

set_termios函数。当设备的 termios 设置更改时,tty 核心将调用此函数。

The set_termios function. This function is called by the tty core when the device's termios settings have been changed.

void (*throttle)(struct tty_struct * tty);

void (*unthrottle)(struct tty_struct * tty);

void (*stop)(struct tty_struct *tty);

void (*start)(struct tty_struct *tty);
void (*throttle)(struct tty_struct * tty);

void (*unthrottle)(struct tty_struct * tty);

void (*stop)(struct tty_struct *tty);

void (*start)(struct tty_struct *tty);

数据节流功能。这些函数用于帮助控制 tty 核心输入缓冲区的溢出。当 tty 核心的输入缓冲区变满时,将调用throttle函数。tty 驱动程序应尝试向设备发出信号,表明不应向其发送更多字符。当 tty 核心的输入缓冲区被清空并且现在可以接受更多数据时,将调用unthrottle函数 。然后,tty 驱动程序应向设备发出可以接收数据的信号。停止和 启动功能很像油门油门函数,但它们表示 tty 驱动程序应停止向设备发送数据,然后再恢复发送数据。

Data-throttling functions. These functions are used to help control overruns of the tty core's input buffers. The throttle function is called when the tty core's input buffers are getting full. The tty driver should try to signal to the device that no more characters should be sent to it. The unthrottle function is called when the tty core's input buffers have been emptied out, and it can now accept more data. The tty driver should then signal to the device that data can be received. The stop and start functions are much like the throttle and unthrottle functions, but they signify that the tty driver should stop sending data to the device and then later resume sending data.

void (*hangup)(struct tty_struct *tty);
void (*hangup)(struct tty_struct *tty);

挂机功能。当 tty 驱动程序应挂起 tty 设备时调用此函数。任何需要执行此操作的特殊硬件操作都应在此时进行。

The hangup function. This function is called when the tty driver should hang up the tty device. Any special hardware manipulation needed to do this should occur at this time.

void (*break_ctl)(struct tty_struct *tty, int state);
void (*break_ctl)(struct tty_struct *tty, int state);

控制功能。当 tty 驱动程序要打开或关闭 RS-232 端口上的线路 BREAK 状态时,调用此函数。如果状态设置为-1,则 BREAK 状态应打开。如果状态设置为0,则 BREAK 状态应关闭。如果此功能由 tty 驱动程序实现,则 tty 核心将处理TCSBRKTCSBRKPTIOCSBRKTIOCCBRK ioctl。否则,这些ioctl将被发送到驱动程序的ioctl函数。

The line break control function. This function is called when the tty driver is to turn on or off the line BREAK status on the RS-232 port. If state is set to -1, the BREAK status should be turned on. If state is set to 0, the BREAK status should be turned off. If this function is implemented by the tty driver, the tty core will handle the TCSBRK, TCSBRKP, TIOCSBRK, and TIOCCBRK ioctls. Otherwise, these ioctls are sent to the driver to the ioctl function.

void (*flush_buffer)(struct tty_struct *tty);
void (*flush_buffer)(struct tty_struct *tty);

刷新缓冲区并丢失所有剩余数据。

Flush buffer and lose any remaining data.

void (*set_ldisc)(struct tty_struct *tty);
void (*set_ldisc)(struct tty_struct *tty);

设定线路纪律功能。当 tty 核心更改了 tty 驱动程序的线路规则时,将调用此函数。该函数通常不被使用,也不应该由驱动程序定义。

The set line discipline function. This function is called when the tty core has changed the line discipline of the tty driver. This function is generally not used and should not be defined by a driver.

void (*send_xchar)(struct tty_struct *tty, char ch);
void (*send_xchar)(struct tty_struct *tty, char ch);

发送X型char函数。该函数用于向 tty 设备发送高优先级 XON 或 XOFF 字符。要发送的字符在ch变量中指定。

Send X-type char function. This function is used to send a high-priority XON or XOFF character to the tty device. The character to be sent is specified in the ch variable.

int (*read_proc)(char *page, char **start, off_t off, int count, int *eof,

,void *data);

int (*write_proc)(struct file *file, const char *buffer, unsigned long count

void *data);
int (*read_proc)(char *page, char **start, off_t off, int count, int *eof,

void *data);

int (*write_proc)(struct file *file, const char *buffer, unsigned long count,

void *data);

/ proc读写功能

/proc read and write functions.

int (*tiocmget)(struct tty_struct *tty, struct file *file);
int (*tiocmget)(struct tty_struct *tty, struct file *file);

获取特定 tty 设备的当前线路设置。如果从 tty 设备成功检索,该值应返回给调用者。

Gets the current line settings of the specific tty device. If retrieved successfully from the tty device, the value should be returned to the caller.

int (*tiocmset)(struct tty_struct *tty, struct file *file, unsigned int set,

unsigned int clear);
int (*tiocmset)(struct tty_struct *tty, struct file *file, unsigned int set,

unsigned int clear);

设置特定 tty 设备的当前线路设置。setclear包含应设置或清除的不同线路设置。

Sets the current line settings of the specific tty device. set and clear contain the different line settings that should either be set or cleared.

tty_struct 结构详细信息

The tty_struct Structure in Detail

变量tty_struct 由 tty 核心用来保存特定 tty 端口的当前状态。除了少数例外,几乎所有字段都只能由 tty 核心使用。tty 驱动程序可以使用的字段如下所述:

The tty_struct variable is used by the tty core to keep the current state of a specific tty port. Almost all of its fields are to be used only by the tty core, with a few exceptions. The fields that a tty driver can use are described here:

unsigned long flags;
unsigned long flags;

tty 设备的当前状态。这是一个位域变量,可通过以下宏访问:

The current state of the tty device. This is a bitfield variable and is accessed through the following macros:

TTY_THROTTLED
TTY_THROTTLED

当驾驶员调用油门功能时设置。不应该由 tty 驱动程序设置,只能由 tty 核心设置。

Set when the driver has had the throttle function called. Should not be set by a tty driver, only the tty core.

TTY_IO_ERROR
TTY_IO_ERROR

当驱动程序不希望从驱动程序读取或写入任何数据时,由驱动程序设置。如果用户程序尝试执行此操作,它将收到来自内核的 -EIO 错误。这通常是在设备关闭时设置的。

Set by the driver when it does not want any data to be read from or written to the driver. If a user program attempts to do this, it receives an -EIO error from the kernel. This is usually set as the device is shutting down.

TTY_OTHER_CLOSED
TTY_OTHER_CLOSED

仅由 pty 驱动程序用于在端口关闭时发出通知。

Used only by the pty driver to notify when the port has been closed.

TTY_EXCLUSIVE
TTY_EXCLUSIVE

由 tty 核心设置,指示端口处于独占模式,并且一次只能由一个用户访问。

Set by the tty core to indicate that a port is in exclusive mode and can only be accessed by one user at a time.

TTY_DEBUG
TTY_DEBUG

未在内核中的任何地方使用。

Not used anywhere in the kernel.

TTY_DO_WRITE_WAKEUP
TTY_DO_WRITE_WAKEUP

如果设置了该值,则允许调用线路规程的write_wakeup函数。这通常是在 tty 驱动程序调用wake_up_interruptible函数的同时调用的 。

If this is set, the line discipline's write_wakeup function is allowed to be called. This is usually called at the same time the wake_up_interruptible function is called by the tty driver.

TTY_PUSH
TTY_PUSH

仅由默认 tty 线路规则在内部使用。

Used only internally by the default tty line discipline.

TTY_CLOSING
TTY_CLOSING

由 tty 核心用来跟踪端口当时是否正在关闭。

Used by the tty core to keep track if a port is in the process of closing at that moment in time or not.

TTY_DONT_FLIP
TTY_DONT_FLIP

由默认的 tty 线路规则使用,以通知 tty 核心在设置翻转缓冲区时不应更改它。

Used by the default tty line discipline to notify the tty core that it should not change the flip buffer when it is set.

TTY_HW_COOK_OUT
TTY_HW_COOK_OUT

如果由 tty 驱动程序设置,它会通知线路规则它将“烹饪”发送给它的输出。如果未设置,线路规则会分块复制驱动程序的输出;否则,它必须评估单独发送的每个字节以进行行更改。该标志通常不应由 tty 驱动程序设置。

If set by a tty driver, it notifies the line discipline that it will "cook" the output sent to it. If it is not set, the line discipline copies output of the driver in chunks; otherwise, it has to evaluate every byte sent individually for line changes. This flag should generally not be set by a tty driver.

TTY_HW_COOK_IN
TTY_HW_COOK_IN

TTY_DRIVER_REAL_RAW几乎与在驱动程序标志变量中设置标志相同。该标志通常不应由 tty 驱动程序设置。

Almost identical to setting the TTY_DRIVER_REAL_RAW flag in the driver flags variable. This flag should generally not be set by a tty driver.

TTY_PTY_LOCK
TTY_PTY_LOCK

由 pty 驱动程序用来锁定和解锁端口。

Used by the pty driver to lock and unlock a port.

TTY_NO_WRITE_SPLIT
TTY_NO_WRITE_SPLIT

如果设置,tty 核心不会将 tty 驱动程序的写入拆分为正常大小的块。该值不应用于通过向端口发送大量数据来防止对 tty 端口的拒绝服务攻击。

If set, the tty core does not split up writes to the tty driver into normal-sized chunks. This value should not be used to prevent denial-of-service attacks on tty ports by sending large amounts of data to a port.

struct tty_flip_buffer flip;
struct tty_flip_buffer flip;

tty 设备的翻转缓冲区。

The flip buffer for the tty device.

struct tty_ldisc ldisc;
struct tty_ldisc ldisc;

tty 设备的线路规则。

The line discipline for the tty device.

wait_queue_head_t write_wait;
wait_queue_head_t write_wait;

tty写入函数的wait_queue。当 tty 驱动程序可以接收更多数据时,应该将其唤醒以发出信号。

The wait_queue for the tty writing function. A tty driver should wake this up to signal when it can receive more data.

struct termios *termios;
struct termios *termios;

指向 tty 设备当前 termios 设置的指针。

Pointer to the current termios settings for the tty device.

unsigned char stopped:1;
unsigned char stopped:1;

指示 tty 设备是否已停止。tty 驱动程序可以设置该值。

Indicates whether the tty device is stopped. The tty driver can set this value.

unsigned char hw_stopped:1;
unsigned char hw_stopped:1;

指示 tty 设备的硬件是否已停止。tty 驱动程序可以设置该值。

Indicates whether or not the tty device's hardware is stopped. The tty driver can set this value.

unsigned char low_latency:1;
unsigned char low_latency:1;

指示 tty 设备是否是低延迟设备,能够以非常高的速度接收数据。tty 驱动程序可以设置该值。

Indicates whether the tty device is a low-latency device, capable of receiving data at a very high rate of speed. The tty driver can set this value.

unsigned char closing:1;
unsigned char closing:1;

指示 tty 设备是否正在关闭端口。tty 驱动程序可以设置该值。

Indicates whether the tty device is in the middle of closing the port. The tty driver can set this value.

struct tty_driver driver;
struct tty_driver driver;

tty_driver控制此 tty 设备的当前结构。

The current tty_driver structure that controls this tty device.

void *driver_data;
void *driver_data;

tty_driver可用于存储 tty 驱动程序本地数据的指针。该变量不被 tty 核心修改。

A pointer that the tty_driver can use to store data local to the tty driver. This variable is not modified by the tty core.

快速参考

Quick Reference

本节 为本章介绍的概念提供参考。它还解释了 tty 驱动程序需要包含的每个头文件的作用。tty_driver不过,和结构中的字段列表tty_device 此处不再重复。

This section provides a reference for the concepts introduced in this chapter. It also explains the role of each header file that a tty driver needs to include. The lists of fields in the tty_driver and tty_device structures, however, are not repeated here.

#include <linux/tty_driver.h>
#include <linux/tty_driver.h>

struct tty_driver包含该结构中使用的一些不同标志的定义和声明的头文件。

Header file that contains the definition of struct tty_driver and declares some of the different flags used in this structure.

#include <linux/tty.h>
#include <linux/tty.h>

头文件包含定义struct tty_struct和许多不同的宏,可以struct termios轻松访问字段的各个值。它还包含 tty 驱动程序核心的函数声明。

Header file that contains the definition of struct tty_struct and a number of different macros to access the individual values of the struct termios fields easily. It also contains the function declarations of the tty driver core.

#include <linux/tty_flip.h>
#include <linux/tty_flip.h>

包含一些 tty 翻转缓冲区内联函数的头文件,可以更轻松地操作翻转缓冲区结构。

Header file that contains some tty flip buffer inline functions that make it easier to manipulate the flip buffer structures.

#include <asm/termios.h>
#include <asm/termios.h>

struct termio包含构建内核的特定硬件平台的定义的头文件。

Header file that contains the definition of struct termio for the specific hardware platform the kernel is built for.

struct tty_driver *alloc_tty_driver(int lines);
struct tty_driver *alloc_tty_driver(int lines);

创建一个struct tty_driver稍后可以传递给 tty_register_drivertty_unregister_driver函数的函数。

Function that creates a struct tty_driver that can be later passed to the tty_register_driver and tty_unregister_driver functions.

void put_tty_driver(struct tty_driver *driver);
void put_tty_driver(struct tty_driver *driver);

清理struct tty_driver尚未成功注册到 tty 核心的结构的函数。

Function that cleans up a struct tty_driver structure that has not been successfully registered with the tty core.

void tty_set_operations(struct tty_driver *driver, struct tty_operations *op);
void tty_set_operations(struct tty_driver *driver, struct tty_operations *op);

初始化 的函数回调的函数struct tty_driver。在调用tty_register_driver之前必须调用此函数 。

Function that initializes the function callbacks of a struct tty_driver. This is necessary to call before tty_register_driver can be called.

int tty_register_driver(struct tty_driver *driver);

int tty_unregister_driver(struct tty_driver *driver);
int tty_register_driver(struct tty_driver *driver);

int tty_unregister_driver(struct tty_driver *driver);

从 tty 核心注册和取消注册 tty 驱动程序的函数。

Functions that register and unregister a tty driver from the tty core.

void tty_register_device(struct tty_driver *driver, unsigned minor, struct

device *device);

void tty_unregister_device(struct tty_driver *driver, unsigned minor);
void tty_register_device(struct tty_driver *driver, unsigned minor, struct

device *device);

void tty_unregister_device(struct tty_driver *driver, unsigned minor);

使用 tty 核心注册和取消注册单个 tty 设备的函数。

Functions that register and unregister a single tty device with the tty core.

void tty_insert_flip_char(struct tty_struct *tty, unsigned char ch,

char flag);
void tty_insert_flip_char(struct tty_struct *tty, unsigned char ch,

char flag);

将字符插入 tty 设备的翻转缓冲区以供用户读取的函数。

Function that inserts characters into the tty device's flip buffer to be read by a user.

TTY_NORMAL

TTY_BREAK

TTY_FRAME

TTY_PARITY

TTY_OVERRUN
TTY_NORMAL

TTY_BREAK

TTY_FRAME

TTY_PARITY

TTY_OVERRUN

tty_insert_flip_char函数中使用的标志参数的不同值 。

Different values for the flag paramater used in the tty_insert_flip_char function.

int tty_get_baud_rate(struct tty_struct *tty);
int tty_get_baud_rate(struct tty_struct *tty);

获取当前为特定 tty 设备设置的波特率的函数。

Function that gets the baud rate currently set for the specific tty device.

void tty_flip_buffer_push(struct tty_struct *tty);
void tty_flip_buffer_push(struct tty_struct *tty);

将当前翻转缓冲区中的数据推送给用户的函数。

Function that pushes the data in the current flip buffer to the user.

tty_std_termios
tty_std_termios

使用一组通用的默认行设置初始化 termios 结构的变量。

Variable that initializes a termios structure with a common set of default line settings.

第 19 章参考书目

Chapter 19. Bibliography

本书中的大部分信息都是从内核源代码中提取的,这是有关 Linux 内核的最好文档。

Most of the information in this book has been extracted from the kernel sources, which are the best documentation about the Linux kernel.

内核源代码可以从全球数百个 FTP 站点检索,因此我们不会在这里列出。

Kernel sources can be retrieved from hundreds of FTP sites around the world, so we won't list them here.

最好通过查看补丁来检查版本依赖性,这些补丁可以从获取整个源代码的同一位置获得。名为repatch的程序 可能会帮助您检查单个文件在不同内核补丁中的修改情况;它可以在 O'Reilly FTP 站点上提供的源文件中找到。

Version dependencies are best checked by looking at the patches, which are available from the same places where you get the whole source. The program called repatch might help you in checking how a single file has been modified throughout the different kernel patches; it is available in the source files provided on the O'Reilly FTP site.

图书

Books

虽然书店里充斥着技术书籍,但与 Linux 内核编程直接相关的书籍却少得惊人。以下是我们书架上精选的书籍。

While the bookstores are full of technical books, there are surprisingly few that are directly relevant to Linux kernel programming. Here is a selection of books found on our shelves.

Linux内核

Linux Kernel

丹尼尔·P·播威 (Bovet) 和马可·塞萨特 (Marco Cesate)。了解 Linux 内核第二版。加利福尼亚州塞巴斯托波尔:O'Reilly & Associates, Inc. 2003。
Bovet, Daniel P. and Marco Cesate. Understanding the Linux Kernel, Second Edition. Sebastopol, CA: O'Reilly & Associates, Inc. 2003.

本书非常详细地介绍了Linux内核的设计和实现。它更倾向于提供对所使用算法的理解,而不是记录内核 API。本书涵盖了 2.4 内核,但仍然包含大量有用的信息。

This book covers the design and implementation of the Linux kernel in great detail. It is more oriented toward providing an understanding of the algorithms used than documenting the kernel API. This book covers the 2.4 kernel but still contains a great deal of useful information.

戈尔曼、梅尔. 了解 Linux 虚拟内存管理器。新泽西州上萨德尔河:Prentice Hall PTR,2004 年。
Gorman, Mel. Understanding the Linux Virtual Memory Manager. Upper Saddle River, NJ: Prentice Hall PTR, 2004.

想要了解更多关于Linux虚拟内存子系统的开发人员不妨看看这本书。它以 2.4 内核为中心,但也包含 2.6 信息。

Developers wanting to know more about the Linux virtual memory subsystem may wish to have a look at this book. It is centered around the 2.4 kernel but contains 2.6 information as well.

爱,罗伯特。Linux 内核开发。印第安纳波利斯:萨姆斯出版社,2004 年。
Love, Robert. Linux Kernel Development. Indianapolis: Sams Publishing, 2004.

本书涵盖了广泛的 Linux 内核编程。每个 Linux 黑客的书架上都应该有一本参考书。

This book covers Linux kernel programming with a broad scope. It is a reference that should be on every Linux hacker's bookshelf.

亚格莫尔、卡里姆. 构建嵌入式系统。加利福尼亚州塞巴斯托波尔:O'Reilly & Associates, Inc. 2003。
Yaghmour, Karim. Building Embedded Systems. Sebastopol, CA: O'Reilly & Associates, Inc. 2003.

本书对于那些为嵌入式系统编写 Linux 代码的人很有用。

This book will be useful to those writing Linux code for embedded systems.

Unix 设计和内部结构

Unix Design and Internals

巴赫、莫里斯. Unix操作系统的设计。新泽西州上萨德尔河:Prentice Hall,1987。
Bach, Maurice. The Design of the Unix Operating System. Upper Saddle River, NJ: Prentice Hall, 1987.

尽管这本书已经很老了,但它涵盖了与 Unix 实现相关的所有问题。这是 Linus 第一个 Linux 版本的主要灵感来源。

Though quite old, this book covers all the issues related to Unix implementations. It was the main source of inspiration for Linus in the first Linux version.

史蒂文斯、理查德. UNIX 环境中的高级编程。波士顿:艾迪生韦斯利,1992。
Stevens, Richard. Advanced Programming in the UNIX Environment. Boston: Addison-Wesley, 1992.

本文描述了 Unix 系统调用的每个细节,这是在设备方法中实现高级功能时的好伴侣。

Every detail of Unix system calls is described herein, which is a good companion when implementing advanced features in the device methods.

史蒂文斯、理查德. Unix 网络编程。新泽西州上萨德尔河:Prentice Hall PTR,1990。
Stevens, Richard. Unix Network Programming. Upper Saddle River, NJ: Prentice Hall PTR, 1990.

也许是关于 Unix 网络编程 API 的权威书籍。

Perhaps the definitive book on the Unix network programming API.

网站

Web Sites

在快速发展的 Linux 内核开发世界中,最新信息通常可以在网上找到。以下是我们在撰写本文时选择的最佳网站:

In the fast-moving world of Linux kernel development, the most current information is often found online. The following is our selection of the best web sites as of this writing:

http://www.kernel.org

ftp://ftp.kernel.org
http://www.kernel.org

ftp://ftp.kernel.org

该站点是 Linux 内核开发的主页。您将找到最新的内核版本和相关信息。请注意,FTP 站点在世界各地都有镜像,因此您很可能会在您附近找到镜像。

This site is the home of Linux kernel development. You'll find the latest kernel release and related information. Note that the FTP site is mirrored throughout the world, so you'll most likely find a mirror near you.

http://www.bkbits.net
http://www.bkbits.net

该站点托管许多著名内核开发人员使用的源代码存储库。特别是,名为“linus”的项目包含由 Linus Torvalds 维护的主线内核。如果您对应用于内核的最新补丁感到好奇,可以在这里查看。

This site hosts the source repositories used by a number of prominent kernel developers. In particular, the project called "linus" contains the mainline kernel as maintained by Linus Torvalds. If you are curious about the very latest patches which have been applied to the kernel, this is the place to look.

http://www.tldp.org
http://www.tldp.org

Linux 文档项目包含许多有趣的文档,称为“HOWTO”;其中一些技术性很强,涵盖了与内核相关的主题。

The Linux Documentation Project carries a lot of interesting documents called "HOWTOs"; some of them are pretty technical and cover kernel-related topics.

http://www.linux.it/kerneldocs
http://www.linux.it/kerneldocs

此页面包含 Alessandro Rubini 撰写的许多面向内核的杂志文章。其中一些可以追溯到几年前,但通常仍然适用;其中一些是意大利语,但通常也有英语翻译。

This page contains many kernel-oriented magazine articles written by Alessandro Rubini. Some of them date back a few years, but they usually still apply; some of them are in Italian, but usually an English translation is available as well.

http://lwn.net
http://lwn.net

冒着看似自私的风险,我们指出这个新闻网站,除其他外,还提供定期的内核开发报道和 API 更改信息。

At the risk of seeming self-serving, we point out this news site that, among other things, offers regular kernel development coverage and API change information.

http://www.kerneltraffic.org
http://www.kerneltraffic.org

Kernel Traffic 是一个受欢迎的网站,每周提供 Linux 内核开发邮件列表的讨论摘要。

Kernel Traffic is a popular site that provides weekly summaries of discussions on the Linux kernel development mailing list.

http://www.kerneltrap.org/
http://www.kerneltrap.org/

该站点偶尔会报道 Linux 和 BSD 内核社区中有趣的进展。

This site picks up occasional interesting developments in the Linux and BSD kernel communities.

http://www.kernelnewbies.org
http://www.kernelnewbies.org

该网站面向新内核开发人员。对于那些寻求即时帮助的人来说,这里有入门信息、常见问题解答和相关的 IRC 频道。

This site is oriented toward new kernel developers. There is beginning information, a FAQ, and an associated IRC channel for those looking for immediate assistance.

http://janitor.kernelnewbies.org/
http://janitor.kernelnewbies.org/

Linux Kernel Janitor 项目是新内核程序员可以学习如何参与开发工作的地方。这里描述了需要在整个内核中完成的各种小而通常简单的任务。有一个邮件列表可以帮助新开发人员将这些更改添加到主内核树中。对于任何想要开始进行 Linux 内核开发但不知道从哪里开始的人来说,这是一个很好的地方。

The Linux Kernel Janitor project is the place where new kernel programmers can learn how to join in the development effort. A wide range of small, generally simple tasks that need to be done all over the kernel are described here. There is a mailing list that helps new developers get these changes into the main kernel tree. This is a great place for anyone wanting to start doing Linux kernel development but not knowing where to begin.

指数

Index

关于数字索引的说明

A note on the digital index

索引条目中的链接显示为该条目所在的部分标题。由于某些部分具有多个索引标记,因此一个条目具有多个指向同一部​​分的链接并不罕见。单击任何链接将直接转到文本中出现标记的位置。

A link in an index entry is displayed as the section title in which that entry appears. Because some sections have multiple index markers, it is not unusual for an entry to have several links to the same section. Clicking on any link will take you directly to the place in the text in which the marker appears.

A

A

抽象(硬件),硬件抽象
abstractions (hardware), Hardware Abstractions
访问、设备驱动程序的角色设备和模块的类设备和模块的类主编号和次编号-主编号的动态分配主编号的动态分配并发性及其管理模糊规则seqlocks功能和受限操作设备文件访问控制一次限制单个用户的访问阻止打开作为 EBUSY 的替代方案内存区域操作 I/O 端口,从用户空间访问 I/O 端口,使用 I/O 内存,访问 I/O 内存,重用 I/O 内存缩写, 1 MB 以下 ISA 内存,快速参考,快速参考, PCI 寻址,访问配置空间,访问 I/O 和内存空间嵌入 kobject内存映射和 DMA内存映射和结构页
access, The Role of the Device Driver, Classes of Devices and Modules, Classes of Devices and Modules, Major and Minor NumbersDynamic Allocation of Major Numbers, Dynamic Allocation of Major Numbers, Concurrency and Its Management, Ambiguous Rules, seqlocks, Capabilities and Restricted Operations, Access Control on a Device File, Restricting Access to a Single User at a Time, Blocking open as an Alternative to EBUSY, Memory zones, Manipulating I/O ports, I/O Port Access from User Space, Using I/O Memory, Accessing I/O Memory, Reusing short for I/O Memory, ISA Memory Below 1 MB, Quick Reference, Quick Reference, PCI Addressing, Accessing the Configuration Space, Accessing the I/O and Memory Spaces, Embedding kobjects, Memory Mapping and DMA, The Memory Map and Struct Page
阻止打开请求,阻止打开作为 EBUSY 的替代方案
字符 (char) 驱动程序、设备和模块类主编号和次编号主编号的动态分配
设备文件,设备文件上的访问控制
DMA、内存映射和 DMA(请参阅 DMA)
对于驱动程序,主要号码的动态分配
I/O 内存、使用 I/O 内存访问 I/O 内存重用 I/O 内存简称
接口、设备类别和模块
ISA内存、1MB以下的ISA内存
kobjects,嵌入kobjects
锁定、不明确的规则
管理、并发及其管理
NUMA 系统、内存区域内存映射和结构页
PCI、PCI 寻址访问配置空间访问 I/O 和内存空间
配置空间,访问配置空间
I/O 和内存空间,访问 I/O 和内存空间
策略,设备驱动程序的角色
端口、操作 I/O 端口从用户空间访问 I/O 端口快速参考
不同尺寸,操作 I/O 端口
从用户空间,从用户空间访问 I/O 端口
能力和受限操作的限制、一次限制单个用户的访问
序列锁,序列锁
未对齐数据,快速参考
blocking open requests, Blocking open as an Alternative to EBUSY
character (char) drivers, Classes of Devices and Modules, Major and Minor NumbersDynamic Allocation of Major Numbers
to device files, Access Control on a Device File
DMA, Memory Mapping and DMA (see DMA)
to drivers, Dynamic Allocation of Major Numbers
I/O memory, Using I/O Memory, Accessing I/O Memory, Reusing short for I/O Memory
interfaces, Classes of Devices and Modules
ISA memory, ISA Memory Below 1 MB
kobjects, Embedding kobjects
locking, Ambiguous Rules
management, Concurrency and Its Management
NUMA systems, Memory zones, The Memory Map and Struct Page
PCI, PCI Addressing, Accessing the Configuration Space, Accessing the I/O and Memory Spaces
configuration space, Accessing the Configuration Space
I/O and memory spaces, Accessing the I/O and Memory Spaces
policies, The Role of the Device Driver
ports, Manipulating I/O ports, I/O Port Access from User Space, Quick Reference
different sizes, Manipulating I/O ports
from user space, I/O Port Access from User Space
restriction of, Capabilities and Restricted Operations, Restricting Access to a Single User at a Time
seqlocks, seqlocks
unaligned data, Quick Reference
access_ok 函数,使用 ioctl 参数
access_ok function, Using the ioctl Argument
ACTION 变量,/sbin/hotplug 实用程序
ACTION variable, The /sbin/hotplug Utility
添加、信号量和互斥体添加设备添加驱动程序添加 VMA 操作
adding, Semaphores and Mutexes, Add a Device, Add a Driver, Adding VMA Operations
设备,添加设备
驱动程序,添加驱动程序
锁定、信号量和互斥体
VMA,添加 VMA 操作
devices, Add a Device
drivers, Add a Driver
locking, Semaphores and Mutexes
VMAs, Adding VMA Operations
地址、分割内核PCI 寻址访问 I/O 和内存空间地址类型重新映射内核虚拟地址总线地址总线地址DMA 映射PCI 双地址循环映射初始化每个设备接口信息打开和结束, MAC 地址解析非以太网标头, MAC 地址解析
addresses, Splitting the Kernel, PCI Addressing, Accessing the I/O and Memory Spaces, Address Types, Remapping Kernel Virtual Addresses, Bus Addresses, Bus Addresses, DMA mappings, PCI double-address cycle mappings, Initializing Each Device, Interface Information, Opening and Closing, MAC Address ResolutionNon-Ethernet Headers, MAC Address Resolution
反弹缓冲区、DMA 映射
巴士,巴士地址
硬件、接口信息打开和关闭
MAC、初始化每个设备MAC 地址解析非以太网标头
PCI、PCI 寻址PCI 双地址周期映射
重新映射,重新映射内核虚拟地址
解决方案(网络管理),拆分内核
解析,MAC地址解析
空间通用 I/O,访问 I/O 和内存空间
类型、地址类型
虚拟(转换),总线地址
bounce buffers, DMA mappings
buses, Bus Addresses
hardware, Interface Information, Opening and Closing
MAC, Initializing Each Device, MAC Address ResolutionNon-Ethernet Headers
PCI, PCI Addressing, PCI double-address cycle mappings
remapping, Remapping Kernel Virtual Addresses
resolution (network management), Splitting the Kernel
resolving, MAC Address Resolution
spaces generic I/O, Accessing the I/O and Memory Spaces
types, Address Types
virtual (conversion), Bus Addresses
aio_fsync操作,异步I/O
aio_fsync operation, Asynchronous I/O
算法(无锁)、无锁算法
algorithms (lock-free), Lock-Free Algorithms
对齐、数据对齐数据对齐快速参考
alignment, Data Alignment, Data Alignment, Quick Reference
数据、数据对齐数据对齐
未对齐数据访问,快速参考
of data, Data Alignment, Data Alignment
unaligned data access, Quick Reference
分配、get_free_page 和 Friends
allocating, get_free_page and Friends
内存、get_free_page 和朋友
按页面、get_free_page 和好友
memory, get_free_page and Friends
by page, get_free_page and Friends
分配、分配和释放设备编号主编号的动态分配Char 设备注册旧方法scull 的内存使用scull 的内存使用kmalloc 的真实故事大小参数标志参数Lookaside 缓存alloc_pages 接口Lookaside Cachesget_free_page 和 Friendsvmalloc 和 Friends使用虚拟地址的 scull: scullvPer-CPU 变量每个 CPU 变量,获取大缓冲区,快速参考,快速参考,快速参考,快速参考, I/O 端口分配, I/O 内存分配和映射, I/O 内存分配和映射,快速参考,快速参考,提交控制 Urb分配 DMA 缓冲区gendisk 结构、 sbull 中的初始化设备注册数据包接收、数据包接收,作用于套接字缓冲区的函数,作用于套接字缓冲区的函数
allocation, Allocating and Freeing Device Numbers, Dynamic Allocation of Major Numbers, Char Device RegistrationThe Older Way, scull's Memory Usagescull's Memory Usage, The Real Story of kmallocThe Size Argument, The Flags Argument, Lookaside CachesThe alloc_pages Interface, Lookaside Caches, get_free_page and Friends, vmalloc and FriendsA scull Using Virtual Addresses: scullv, Per-CPU VariablesPer-CPU Variables, Obtaining Large Buffers, Quick Reference, Quick Reference, Quick Reference, Quick Reference, I/O Port Allocation, I/O Memory Allocation and Mapping, I/O Memory Allocation and Mapping, Quick Reference, Quick Reference, Submitting and Controlling a Urb, Allocating the DMA Buffer, The gendisk structure, Initialization in sbull, Device Registration, Packet Reception, Packet Reception, Functions Acting on Socket Buffers, Functions Acting on Socket Buffers
块驱动程序,在 sbull 中初始化
缓冲区的数量,作用于套接字缓冲区的函数
设备编号、分配和释放设备编号
DMA 缓冲区的数量,分配 DMA 缓冲区
动态分配主号码, Dynamic Allocation of Major Numbers
gendisk 结构, gendisk 结构
I/O 端口数、I/O 端口分配
内存、scull 的内存使用情况scull 的内存使用情况kmalloc 的真实故事大小参数标志参数、Lookaside 缓存alloc_pages 接口Lookaside 缓存vmalloc 和朋友使用虚拟地址的 scull:scullv每个 CPU变量每个 CPU 变量获取大缓冲区快速参考快速参考快速参考I/O 内存分配和映射快速参考
启动时间、获取大缓冲区快速参考
flags、标志参数后备缓存快速参考
I/O、I/O 内存分配和映射快速参考
kmalloc 分配引擎,kmalloc 的真实故事大小争论
Lookaside 缓存,Lookaside 缓存alloc_pages 接口快速参考
每 CPU 变量、每 CPU 变量每 CPU 变量
vmalloc 分配函数、vmalloc 和朋友使用虚拟地址的 scull:scullv
面向页面的函数,get_free_page 和 Friends快速参考
snull 驱动程序、设备注册
套接字缓冲区、数据包接收数据包接收作用于套接字缓冲区的函数
结构(注册),Char 设备注册旧方法
of urbs,提交和控制 Urb
of block drivers, Initialization in sbull
of buffers, Functions Acting on Socket Buffers
of device numbers, Allocating and Freeing Device Numbers
of DMA buffers, Allocating the DMA Buffer
dynamic allocation of major numbers, Dynamic Allocation of Major Numbers
of gendisk structures, The gendisk structure
of I/O ports, I/O Port Allocation
of memory, scull's Memory Usagescull's Memory Usage, The Real Story of kmallocThe Size Argument, The Flags Argument, Lookaside CachesThe alloc_pages Interface, Lookaside Caches, vmalloc and FriendsA scull Using Virtual Addresses: scullv, Per-CPU VariablesPer-CPU Variables, Obtaining Large Buffers, Quick Reference, Quick Reference, Quick Reference, I/O Memory Allocation and Mapping, Quick Reference
boot time, Obtaining Large Buffers, Quick Reference
flags, The Flags Argument, Lookaside Caches, Quick Reference
I/O, I/O Memory Allocation and Mapping, Quick Reference
kmalloc allocation engine, The Real Story of kmallocThe Size Argument
lookaside caches, Lookaside CachesThe alloc_pages Interface, Quick Reference
per-CPU variables, Per-CPU VariablesPer-CPU Variables
vmalloc allocation function, vmalloc and FriendsA scull Using Virtual Addresses: scullv
page-oriented functions, get_free_page and Friends, Quick Reference
of snull drivers, Device Registration
of socket buffers, Packet Reception, Packet Reception, Functions Acting on Socket Buffers
structures (registration), Char Device RegistrationThe Older Way
of urbs, Submitting and Controlling a Urb
alloc_netdev函数,初始化每个设备
alloc_netdev function, Initializing Each Device
alloc_pages 接口, alloc_pages 接口
alloc_pages interface, The alloc_pages Interface
alloc_skb 函数,作用于套接字缓冲区的函数
alloc_skb function, Functions Acting on Socket Buffers
alloc_tty_driver 函数,一个小型 TTY 驱动程序
alloc_tty_driver function, A Small TTY Driver
Alpha 架构、移植和平台依赖性
Alpha architecture, porting and, Platform Dependencies
锁定的替代方案,锁定的替代方案读取-复制-更新
alternatives to locking, Alternatives to LockingRead-Copy-Update
API(应用程序编程接口)、Spinlock API 简介Timer API
API (application programming interface), Introduction to the Spinlock API, The Timer API
自旋锁,自旋锁 API 简介
定时器,定时器 API
spinlocks, Introduction to the Spinlock API
timers, The Timer API
应用程序编程接口,Spinlock API 简介(请参阅 API)
application programming interface, Introduction to the Spinlock API (see API)
应用程序、与内核的比较、内核模块与应用程序
applications, comparisons to kernels, Kernel Modules Versus Applications
体系结构、平台依赖性平台依赖性平台依赖性平台依赖性x86 上中断处理的内部结构、 PCI接口-硬件抽象MCAEISAVLBSBusNuBusS/390 和 zSeriesS/390 和z系列
architecture, Platform Dependencies, Platform Dependencies, Platform Dependencies, Platform Dependencies, The internals of interrupt handling on the x86, The PCI InterfaceHardware Abstractions, MCA, EISA, VLB, SBus, NuBus, S/390 and zSeries, S/390 and zSeries
环境影响评估,环境影响评估
M68k(移植和),平台依赖性
马华、马华
努巴士,努巴士
PCI,PCI 接口硬件抽象
PowerPC(移植和)、平台依赖性
S/390、S/390 和 z 系列
系统总线、系统总线
SPARC,平台依赖性
Super-H,平台依赖性
VLB, VLB
x86(中断处理程序打开),x86 上中断处理的内部结构
z 系列、S/390 和 z 系列
EISA, EISA
M68k (porting and), Platform Dependencies
MCA, MCA
NuBus, NuBus
PCI, The PCI InterfaceHardware Abstractions
PowerPC (porting and), Platform Dependencies
S/390, S/390 and zSeries
SBus, SBus
SPARC, Platform Dependencies
Super-H, Platform Dependencies
VLB, VLB
x86 (interrupt handlers on), The internals of interrupt handling on the x86
zSeries, S/390 and zSeries
参数、seq_file 接口使用 ioctl 参数标志参数大小参数旁视缓存处理程序参数和返回值
arguments, The seq_file interface, Using the ioctl Argument, The Flags Argument, The Size Argument, Lookaside Caches, Handler Arguments and Return Value
缓存、后备缓存
旗帜,旗帜争论
中断处理程序、处理程序参数和返回值
ioctl 方法,使用 ioctl 参数
kmalloc 大小,大小参数
sfile,seq_file 接口
cache, Lookaside Caches
flags, The Flags Argument
interrupt handlers, Handler Arguments and Return Value
ioctl method, Using the ioctl Argument
kmalloc size, The Size Argument
sfile, The seq_file interface
ARM 架构、移植和平台依赖性
ARM architecture, porting and, Platform Dependencies
ARP(地址解析协议)、初始化每个设备初始化每个设备接口信息在以太网中使用 ARP覆盖 ARP
ARP (Address Resolution Protocol), Initializing Each Device, Initializing Each Device, Interface Information, Using ARP with Ethernet, Overriding ARP
以太网以及通过以太网使用 ARP
IFF_NOARP 标志和、初始化每个设备接口信息
覆盖,覆盖 ARP
Ethernet and, Using ARP with Ethernet
IFF_NOARP flag and, Initializing Each Device, Interface Information
overriding, Overriding ARP
数组、模块参数scull 的内存使用内存映射和结构页sbull 中的初始化bio 结构
arrays, Module Parameters, scull's Memory Usage, The Memory Map and Struct Page, Initialization in sbull, The bio structure
bi_io_vec,生物结构
块驱动程序,在 sbull 中初始化
内存映射、内存映射和结构页
参数(声明),模块参数
量子集(内存),scull 的内存使用情况
bi_io_vec, The bio structure
block drivers, Initialization in sbull
memory maps, The Memory Map and Struct Page
parameters (declaration of), Module Parameters
quantum sets (memory), scull's Memory Usage
asm 目录,内核模块与应用程序
asm directory, Kernel Modules Versus Applications
分配、模块参数模块参数主号码动态分配分配 IP 号码打开和关闭
assignment, Module ParametersModule Parameters, Dynamic Allocation of Major Numbers, Assigning IP Numbers, Opening and Closing
动态分配主号码, Dynamic Allocation of Major Numbers
硬件地址、打开和关闭
IP 号码、分配 IP 号码
参数值,模块参数模块参数
dynamic allocation of major numbers, Dynamic Allocation of Major Numbers
of hardware addresses, Opening and Closing
of IP numbers, Assigning IP Numbers
of parameter values, Module ParametersModule Parameters
异步 DMA,DMA 数据传输概述
asynchronous DMA, Overview of a DMA Data Transfer
异步 I/O,异步 I/O异步 I/O 示例
asynchronous I/O, Asynchronous I/OAn asynchronous I/O example
异步通知,异步通知驱动程序的观点
asynchronous notification, Asynchronous NotificationThe Driver's Point of View
异步运行定时器、内核定时器
asynchronous running of timers, Kernel Timers
异步测试程序,异步通知
asynctest program, Asynchronous Notification
原子上下文(自旋锁),自旋锁和原子上下文
atomic context (spinlocks), Spinlocks and Atomic Context
原子变量,原子变量
atomic variables, Atomic Variables
atomic_add操作,原子变量
atomic_add operation, Atomic Variables
atomic_dec 操作,原子变量
atomic_dec operation, Atomic Variables
atomic_dec_and_test 操作,原子变量
atomic_dec_and_test operation, Atomic Variables
atomic_inc操作,原子变量
atomic_inc operation, Atomic Variables
atomic_inc_and_test 操作,原子变量
atomic_inc_and_test operation, Atomic Variables
atomic_read操作,原子变量
atomic_read operation, Atomic Variables
atomic_set操作,原子变量
atomic_set operation, Atomic Variables
atomic_sub操作,原子变量
atomic_sub operation, Atomic Variables
atomic_sub_and_test 操作,原子变量
atomic_sub_and_test operation, Atomic Variables
atomic_t 计数字段(内存)、内存映射和结构页
atomic_t count field (memory), The Memory Map and Struct Page
属性,默认属性,非默认属性,非默认属性,二进制属性,总线属性,总线属性,设备属性,驱动程序结构嵌入,工作原理,工作原理,工作原理
attributes, Default Attributes, Nondefault Attributes, Nondefault Attributes, Binary Attributes, Bus attributes, Bus attributes, Device attributes, Driver structure embedding, How It Works, How It Works, How It Works
二进制(kobjects),二进制属性
总线、总线属性
数据(固件),工作原理
默认(kobjects),默认属性
删除、非默认属性总线属性
设备、设备属性工作原理
驱动程序、驱动程序结构嵌入
加载(固件),它是如何工作的
非默认(kobjects),非默认属性
binary (kobjects), Binary Attributes
buses, Bus attributes
data (firmware), How It Works
default (kobjects), Default Attributes
deleting, Nondefault Attributes, Bus attributes
devices, Device attributes, How It Works
drivers, Driver structure embedding
loading (firmware), How It Works
nondefault (kobjects), Nondefault Attributes
授权、安全问题
authorization, Security Issues
自动检测,自动检测IRQ号
autodetection, Autodetecting the IRQ Number
自动IRQ号检测,AutoDetecting the IRQ Number
automatic IRQ number detection, Autodetecting the IRQ Number

B

向后转换 kobject 指针,嵌入 kobject
back-casting kobject pointers, Embedding kobjects
屏障、I/O 寄存器和传统内存I/O 寄存器和传统内存快速参考屏障请求
barriers, I/O Registers and Conventional Memory, I/O Registers and Conventional Memory, Quick Reference, Barrier requests
存储器、I/O 寄存器和传统存储器I/O 寄存器和传统存储器快速参考
请求、障碍请求
memory, I/O Registers and Conventional Memory, I/O Registers and Conventional Memory, Quick Reference
requests, Barrier requests
基本模块参数,示例驱动程序
base module parameter, A Sample Driver
波特率(tty 驱动程序)、set_termios
baud rates (tty drivers), set_termios
BCD(二进制编码的十进制)形式,驱动程序支持哪些设备?
BCD (binary-coded decimal) forms, What Devices Does the Driver Support?
bEndpointAddress 字段 (USB),端点
bEndpointAddress field (USB), Endpoints
大端字节顺序、字节顺序
big-endian byte order, Byte Order
二进制属性(kobjects),二进制属性
binary Attributes (kobjects), Binary Attributes
二进制编码十进制 (BCD) 形式,驱动程序支持哪些设备?
binary-coded decimal (BCD) forms, What Devices Does the Driver Support?
b间隔字段 (USB)、端点
bInterval field (USB), Endpoints
bin_attribute 结构,二进制属性
bin_attribute structure, Binary Attributes
Bio 结构, Bio 结构,使用 BIOS
bio structure, The bio structure, Working with bios
位字段(ioctl 命令)、选择 ioctl 命令快速参考
bitfields (ioctl commands), Choosing the ioctl Commands, Quick Reference
位、位操作并行端口概述实现处理程序
bits, Bit Operations, An Overview of the Parallel Port, Implementing a Handler
清除,实现处理程序
运算、位运算
规格,并行端口概述
clearing, Implementing a Handler
operations, Bit Operations
specifications, An Overview of the Parallel Port
bi_io_vec数组,bio结构
bi_io_vec array, The bio structure
blkdev_dequeue_request 函数,排队函数
blkdev_dequeue_request function, Queueing functions
BLK_BOUNCE_HIGH 符号,队列控制函数
BLK_BOUNCE_HIGH symbol, Queue control functions
blk_cleanup_queue函数,队列创建和删除
blk_cleanup_queue function, Queue creation and deletion
blk_queue_hardsect_size 函数,关于扇区大小的注释
blk_queue_hardsect_size function, A Note on Sector Sizes
blk_queue_segment_boundary函数,队列控制函数
blk_queue_segment_boundary function, Queue control functions
块设备、设备类和模块
block devices, Classes of Devices and Modules
块驱动程序、注册关于扇区大小的说明块设备操作ioctl 方法请求处理不使用请求队列命令预先准备标记命令队列标记命令队列快速参考快速参考
block drivers, RegistrationA Note on Sector Sizes, The Block Device OperationsThe ioctl Method, Request ProcessingDoing without a request queue, Command Pre-Preparation, Tagged Command QueueingTagged Command Queueing, Quick ReferenceQuick Reference
命令准备,命令准备
功能,快速参考快速参考
操作,块设备操作ioctl 方法
注册、注册——关于部门规模的说明
请求处理,请求处理没有请求队列
TCQ,标记命令队列标记命令队列
command pre-preparation, Command Pre-Preparation
functions, Quick ReferenceQuick Reference
operations, The Block Device OperationsThe ioctl Method
registration, RegistrationA Note on Sector Sizes
request processing, Request ProcessingDoing without a request queue
TCQ, Tagged Command QueueingTagged Command Queueing
阻塞、阻塞 I/O测试 Scullpipe 驱动程序阻塞和非阻塞操作阻塞打开作为 EBUSY 的替代方案阻塞打开作为 EBUSY 的替代方案阻塞打开作为 EBUSY 的替代方案
blocking, Blocking I/OTesting the Scullpipe Driver, Blocking and Nonblocking Operations, Blocking open as an Alternative to EBUSY, Blocking open as an Alternative to EBUSY, Blocking open as an Alternative to EBUSY
I/O、阻塞 I/O测试 Scullpipe 驱动程序阻塞打开作为 EBUSY 的替代方案
open 方法,阻塞 open 作为 EBUSY 的替代方案
操作、阻塞和非阻塞操作
释放方法,阻塞打开作为 EBUSY 的替代方案
I/O, Blocking I/OTesting the Scullpipe Driver, Blocking open as an Alternative to EBUSY
open method, Blocking open as an Alternative to EBUSY
operations, Blocking and Nonblocking Operations
release method, Blocking open as an Alternative to EBUSY
block_fsync方法,刷新挂起的输出
block_fsync method, Flushing pending output
bmAttributes 字段 (USB)、端点
bmAttributes field (USB), Endpoints
BogoMips 值,短延迟
BogoMips value, Short Delays
启动时间(内存分配)、获取大缓冲区快速参考
boot time (memory allocation), Obtaining Large Buffers, Quick Reference
启动 (PCI)、启动时间
booting (PCI), Boot Time
下半部、上半部和下半部工作队列Tasklet
bottom halves, Top and Bottom HalvesWorkqueues, Tasklets
中断处理程序、上半部和下半部工作队列
小任务和小任务
interrupt handlers, Top and Bottom HalvesWorkqueues
tasklets and, Tasklets
反弹缓冲区、DMA 映射设置流 DMA 映射队列控制功能
bounce buffers, DMA mappings, Setting up streaming DMA mappings, Queue control functions
块驱动程序、队列控制函数
流 DMA 映射以及设置流 DMA 映射
block drivers, Queue control functions
streaming DMA mappings and, Setting up streaming DMA mappings
桥接器、PCI 寻址
bridges, PCI Addressing
BSS 段,虚拟内存区域
BSS segments, Virtual Memory Areas
缓冲区、安全问题消息如何记录消息如何记录Oops 消息无锁算法阻塞和非阻塞操作阻塞和非阻塞操作获取大缓冲区快速参考写缓冲示例struct urb执行直接 I/ODMA 数据传输概述DMA 映射设置流式 DMA 映射设置流式 DMA 映射设置流 DMA 映射PCI 双地址循环映射队列控制函数数据包接收套接字缓冲区-作用于套接字缓冲区的函数、作用于套接字缓冲区的函数、作用于套接字缓冲区的函数其他缓冲函数
buffers, Security Issues, How Messages Get Logged, How Messages Get Logged, Oops Messages, Lock-Free Algorithms, Blocking and Nonblocking Operations, Blocking and Nonblocking Operations, Obtaining Large Buffers, Quick Reference, A Write-Buffering Example, struct urb, Performing Direct I/O, Overview of a DMA Data Transfer, DMA mappings, Setting up streaming DMA mappings, Setting up streaming DMA mappings, Setting up streaming DMA mappings, PCI double-address cycle mappings, Queue control functions, Packet Reception, The Socket BuffersFunctions Acting on Socket Buffers, Functions Acting on Socket Buffers, Functions Acting on Socket Buffers, Other Buffering Functions
作用于套接字缓冲区的函数的分配
反弹、DMA 映射设置流 DMA 映射队列控制功能
块驱动程序、队列控制函数
流 DMA 映射以及设置流 DMA 映射
循环,如何记录消息无锁算法
DMA(取消映射),设置流 DMA 映射
释放,作用于套接字缓冲区的函数
I/O、阻塞和非阻塞操作
Large(获取),获取大缓冲区快速参考
输出、阻塞和非阻塞操作
溢出错误、安全问题糟糕消息
对于 printk 函数,如何记录消息
环 (DMA),DMA 数据传输概述
套接字、数据包接收套接字缓冲区作用于套接字缓冲区的函数
同步、PCI 双地址周期映射
传输,设置流 DMA 映射
tty 驱动程序、其他缓冲功能
USB,结构 urb
用户空间(直接 I/O),执行直接 I/O
写缓冲示例,写缓冲示例
allocation of, Functions Acting on Socket Buffers
bounce, DMA mappings, Setting up streaming DMA mappings, Queue control functions
block drivers, Queue control functions
streaming DMA mappings and, Setting up streaming DMA mappings
circular, How Messages Get Logged, Lock-Free Algorithms
DMA (unmapping), Setting up streaming DMA mappings
freeing, Functions Acting on Socket Buffers
I/O, Blocking and Nonblocking Operations
large (obtaining), Obtaining Large Buffers, Quick Reference
output, Blocking and Nonblocking Operations
overrun errors, Security Issues, Oops Messages
for printk function, How Messages Get Logged
ring (DMA), Overview of a DMA Data Transfer
sockets, Packet Reception, The Socket BuffersFunctions Acting on Socket Buffers
synchronization, PCI double-address cycle mappings
transfers, Setting up streaming DMA mappings
tty drivers, Other Buffering Functions
USB, struct urb
user space (direct I/O), Performing Direct I/O
write-buffering example, A Write-Buffering Example
批量端点 (USB)、端点
BULK endpoints (USB), Endpoints
批量 urbs (USB),批量 urbs
bulk urbs (USB), Bulk urbs
总线, PCI 驱动程序, PCI 寻址, USB 驱动程序,总线总线属性,总线注册,总线方法,总线方法,迭代设备和驱动程序,总线属性, IEEE1394 (FireWire) ,总线、设备和驱动程序,地址类型,总线地址, DMA 映射
buses, PCI Drivers, PCI Addressing, USB Drivers, BusesBus attributes, Bus registration, Bus methods, Bus methods, Iterating over devices and drivers, Bus attributes, IEEE1394 (FireWire), Buses, Devices, and Drivers, Address Types, Bus Addresses, DMA mappings
地址、地址类型总线地址
属性、总线属性
函数、总线、设备和驱动程序
IEEE1394(火线)、IEEE1394(火线)
迭代,迭代设备和驱动程序
Linux 设备模型,总线总线属性
匹配函数、总线方法
方法、总线方法
PCI、PCI 驱动程序PCI 寻址(请参阅 PCI)
寄存器、DMA 映射
登记,巴士登记
USB、USB 驱动程序(请参阅 USB)
addresses, Address Types, Bus Addresses
attributes, Bus attributes
functions, Buses, Devices, and Drivers
IEEE1394 (Firewire), IEEE1394 (FireWire)
iteration, Iterating over devices and drivers
Linux device model, BusesBus attributes
match function, Bus methods
methods, Bus methods
PCI, PCI Drivers, PCI Addressing (see PCI)
registers, DMA mappings
registration, Bus registration
USB, USB Drivers (see USB)
忙循环,忙等待
busy loops, Busy waiting
忙等待实现,忙等待
busy-waiting implementation, Busy waiting
bus_add_driver函数,添加驱动程序
bus_add_driver function, Add a Driver
BUS_ATTR 宏,总线属性
BUS_ATTR macro, Bus attributes
Bus_attribute 类型,总线属性
bus_attribute type, Bus attributes
bus_for_each_dev 函数,迭代设备和驱动程序
bus_for_each_dev function, Iterating over devices and drivers
bus_register函数,总线注册
bus_register function, Bus registration
Bus_type 结构,总线
bus_type structure, Buses
字节、字节顺序快速参考set_termios
bytes, Byte Order, Quick Reference, set_termios
CSIZE 位掩码,set_termios
顺序、字节顺序
订单、快速参考
CSIZE bitmask, set_termios
order, Byte Order
orders, Quick Reference

C

C

缓存、Lookaside Cachesalloc_pages 接口Lookaside Caches快速参考I/O 寄存器和常规内存I/O 寄存器和常规内存使用 remap_pfn_rangeDMA 映射
caches, Lookaside CachesThe alloc_pages Interface, Lookaside Caches, Quick Reference, I/O Registers and Conventional Memory, I/O Registers and Conventional Memory, Using remap_pfn_range, DMA mappings
参数,后备缓存
一致性问题、DMA 映射
Lookaside,Lookaside Cachesalloc_pages 接口快速参考
故障排除、I/O 寄存器和常规内存I/O 寄存器和常规内存使用 remap_pfn_range
argument, Lookaside Caches
coherency issues, DMA mappings
lookaside, Lookaside CachesThe alloc_pages Interface, Quick Reference
troubleshooting, I/O Registers and Conventional Memory, I/O Registers and Conventional Memory, Using remap_pfn_range
调用、当前进程观察调试ioctl单开设备I/O 寄存器和常规内存I/O 寄存器和常规内存使用 I/O 内存工作原理命令准备
calling, The Current Process, Debugging by Watching, ioctl, Single-Open Devices, I/O Registers and Conventional Memory, I/O Registers and Conventional Memory, Using I/O Memory, How It Works, Command Pre-Preparation
当前进程,当前进程
固件,工作原理
ioctl 方法, ioctl
ioremap函数,使用I/O内存
内存屏障、I/O 寄存器和传统内存I/O 寄存器和传统内存
perror 调用,通过观察进行调试
准备功能,命令预准备
发布,单开设备
current process, The Current Process
firmware, How It Works
ioctl method, ioctl
ioremap function, Using I/O Memory
memory barriers, I/O Registers and Conventional Memory, I/O Registers and Conventional Memory
perror calls, Debugging by Watching
preparation functions, Command Pre-Preparation
release, Single-Open Devices
取消城市,取消城市
cancellation of urbs, Canceling Urbs
能力、受限操作以及能力和受限操作
capabilities, restricted operations and, Capabilities and Restricted Operations
Capability.h 头文件,功能和限制操作快速参考
capability.h header file, Capabilities and Restricted Operations, Quick Reference
可用功能、功能和受限操作快速参考
capable function, Capabilities and Restricted Operations, Quick Reference
CAP_DAC_OVERRIDE 功能、功能和受限操作一次限制单个用户的访问
CAP_DAC_OVERRIDE capability, Capabilities and Restricted Operations, Restricting Access to a Single User at a Time
单用户访问设备,一次限制单个用户的访问
single-user access to devices, Restricting Access to a Single User at a Time
CAP_NET_ADMIN 功能、功能和受限操作
CAP_NET_ADMIN capability, Capabilities and Restricted Operations
CAP_SYS_ADMIN 功能、功能和受限操作
CAP_SYS_ADMIN capability, Capabilities and Restricted Operations
CAP_SYS_MODULE 功能、功能和受限操作
CAP_SYS_MODULE capability, Capabilities and Restricted Operations
CAP_SYS_RAWIO 功能、功能和受限操作
CAP_SYS_RAWIO capability, Capabilities and Restricted Operations
CAP_SYS_TTY_CONFIG 功能、功能和受限操作
CAP_SYS_TTY_CONFIG capability, Capabilities and Restricted Operations
卡选择号 (CSN),即插即用规范
card select number (CSN), The Plug-and-Play Specification
cardctl 实用程序,设备驱动程序的作用
cardctl utility, The Role of the Device Driver
载波信号、链路状态的变化
carrier signals, Changes in Link State
cdev结构体,Char设备注册
cdev structure, Char Device Registration
Change_bit 操作,位操作
change_bit operation, Bit Operations
change_mtu 方法、设备方法数据包接收
change_mtu method, The Device Methods, Packet Reception
使用套接字缓冲区、数据包接收提高性能
improving performance using socket buffers, Packet Reception
char(字符)驱动程序、设备和模块类scull 的设计scull 的设计主次编号主次编号主编号的动态分配文件操作文件操作文件结构inode结构体字符设备注册旧方法open 方法open 方法release 方法scull 的内存使用scull 的内存使用读和写读和写readv 和 writevreadv 和 writev使用新设备ioctl没有 ioctl 的设备控制阻塞 I/O测试 Scullpipe 驱动程序poll 和 select底层数据结构poll 和select底层数据结构异步通知驱动程序的观点寻找设备设备文件的访问控制
char (character) drivers, Classes of Devices and Modules, The Design of scull, The Design of scull, Major and Minor Numbers, Major and Minor NumbersDynamic Allocation of Major Numbers, File OperationsFile Operations, The file Structure, The inode Structure, Char Device RegistrationThe Older Way, The open MethodThe open Method, The release Method, scull's Memory Usagescull's Memory Usage, read and write, read and write, readv and writev, readv and writev, Playing with the New Devices, ioctlDevice Control Without ioctl, Blocking I/OTesting the Scullpipe Driver, poll and selectThe Underlying Data Structure, poll and selectThe Underlying Data Structure, Asynchronous NotificationThe Driver's Point of View, Seeking a Device, Access Control on a Device File
访问、主要号码和次要号码主要号码的动态分配
异步通知,异步通知驱动程序的观点
定义机制,scull 的设计
文件、文件操作文件操作文件结构设备文件的访问控制
访问,设备文件的访问控制
操作,文件操作文件操作
结构,文件结构
I/O、阻塞 I/O测试 Scullpipe 驱动程序
索引节点结构,索引节点结构
ioctl 方法,ioctl不使用 ioctl 的设备控制
llseek 方法,寻找设备
内存使用情况(scull),scull 的内存使用情况scull 的内存使用情况
open 方法, open 方法open 方法
poll 方法,poll 和 select底层数据结构
read方法,读写
readv 调用、readv 和 writev
注册,Char 设备注册旧方法
释放方法,释放方法
双桨(设计),双桨的设计
select 方法、poll 和 select底层数据结构
测试、使用新设备
版本号、主要版本号和次要版本号
write方法,读写
writev 调用、readv 和 writev
access, Major and Minor NumbersDynamic Allocation of Major Numbers
asynchronous notification, Asynchronous NotificationThe Driver's Point of View
defining mechanism of, The Design of scull
files, File OperationsFile Operations, The file Structure, Access Control on a Device File
access to, Access Control on a Device File
operations, File OperationsFile Operations
structures, The file Structure
I/O, Blocking I/OTesting the Scullpipe Driver
inode structure, The inode Structure
ioctl method, ioctlDevice Control Without ioctl
llseek method, Seeking a Device
memory usage (scull), scull's Memory Usagescull's Memory Usage
open method, The open MethodThe open Method
poll method, poll and selectThe Underlying Data Structure
read method, read and write
readv calls, readv and writev
registration, Char Device RegistrationThe Older Way
release method, The release Method
scull (design of), The Design of scull
select method, poll and selectThe Underlying Data Structure
testing, Playing with the New Devices
version numbers, Major and Minor Numbers
write method, read and write
writev calls, readv and writev
char *buffer 字段(请求结构体),一个简单的请求方法
char *buffer field (request structure), A Simple request Method
char *name 变量 (USB)、探测和断开详细信息
char *name variable (USB), probe and disconnect in Detail
charbus_id字段,设备
char bus_id field, Devices
char disk_name 字段 (gendisk), gendisk 结构
char disk_name field (gendisk), The gendisk structure
字符名称字段(net_device结构),全局信息
char name field (net_device structure), Global Information
字符驱动程序、设备和模块类(请参阅字符驱动程序)
character drivers, Classes of Devices and Modules (see char drivers)
chars_in_buffer 函数,其他缓冲函数
chars_in_buffer function, Other Buffering Functions
CHECKSUM_符号,数据包接收
CHECKSUM_ symbols, Packet Reception
check_flags方法,文件操作
check_flags method, File Operations
循环缓冲区、如何记录消息无锁算法实现处理程序DMA 数据传输概述
circular buffers, How Messages Get Logged, Lock-Free Algorithms, Implementing a Handler, Overview of a DMA Data Transfer
DMA 环形缓冲区,DMA 数据传输概述
实现中断处理程序,实现处理程序
对于 printk 函数,如何记录消息
DMA ring buffers, Overview of a DMA Data Transfer
implementing interrupt handlers, Implementing a Handler
for printk function, How Messages Get Logged
Claim_dma_lock函数,与DMA控制器对话
claim_dma_lock function, Talking to the DMA controller
类寄存器 (PCI)、配置寄存器和初始化
class register (PCI), Configuration Registers and Initialization
类、设备和模块类设备和模块类设备和模块类Linux 设备模型类接口管理类类设备接口、
classes, Classes of Devices and ModulesClasses of Devices and Modules, Classes of Devices and Modules, The Linux Device Model, ClassesClass interfaces, Managing classes, Class devices, Class interfaces, Classes
设备、设备和模块类Linux 设备模型设备类
函数、
接口、类接口
Linux 设备模型,类接口
管理、班级管理
模块、设备和模块类别设备和模块类别
devices, Classes of Devices and Modules, The Linux Device Model, Class devices
functions, Classes
interfaces, Class interfaces
Linux device model, ClassesClass interfaces
management, Managing classes
modules, Classes of Devices and ModulesClasses of Devices and Modules
class_id 字段,类设备
class_id field, Class devices
class_simple 接口, class_simple 接口
class_simple interface, The class_simple Interface
class_simple_create 函数,udev
class_simple_create function, udev
class_simple_device_add函数,udev
class_simple_device_add function, udev
class_simple_device_remove 函数,udev
class_simple_device_remove function, udev
清理函数,清理函数
cleanup function, The Cleanup Function
清除接口板上的位,实现处理程序
clearing bits on interface boards, Implementing a Handler
清除位操作,位操作
clear_bit operation, Bit Operations
clear_dma_ff 函数,与 DMA 控制器对话
clear_dma_ff function, Talking to the DMA controller
时钟,处理器特定寄存器,计时,计时
clocks, Processor-Specific Registers, Timekeeping, Timekeeping
(另见时间)
周期(计数)、处理器特定寄存器
(see also time)
cycles (counting), Processor-Specific Registers
克隆设备,在打开时克隆设备
cloning devices, Cloning the Device on open
关闭函数(tty 驱动程序),打开和关闭打开和关闭
close function (tty drivers), open and closeopen and close
close方法、release方法vm_area_struct结构体
close method, The release Method, The vm_area_struct structure
vm_operations_struct 结构、vm_area_struct 结构
vm_operations_struct structure, The vm_area_struct structure
cmd字段(请求结构体),命令前期准备
cmd field (request structure), Command Pre-Preparation
粗粒度锁定,细粒度锁定与粗粒度锁定
coarse-grained locking, Fine- Versus Coarse-Grained Locking
代码、可加载模块设置测试系统Hello World 模块Hello World 模块用户空间和内核空间内核中的并发性预备知识在用户空间中执行操作在用户空间中执行操作通过打印进行调试scull 中的陷阱手动睡眠限制一次对单个用户的访问特定于处理器的寄存器延迟执行短延迟短延迟延迟, ISA 编程
code, Loadable Modules, Setting Up Your Test System, The Hello World ModuleThe Hello World Module, User Space and Kernel Space, Concurrency in the Kernel, Preliminaries, Doing It in User SpaceDoing It in User Space, Debugging by Printing, Pitfalls in scull, Manual sleeps, Restricting Access to a Single User at a Time, Processor-Specific Registers, Delaying ExecutionShort Delays, Short Delays, Delays, ISA Programming
并发性,内核中的并发性
延迟执行,短延迟
执行、延迟执行短延迟延迟
你好世界模块,你好世界模块-你好世界模块
内联汇编(示例)、处理器特定寄存器
ISA、ISA 编程
内核,通过打印进行调试(请参阅内核)
内存(scull),scull 中的陷阱
模块要求,预备知识
运行时,可加载模块
scilluid,一次限制单个用户的访问
休眠、手动休眠
测试系统设置,设置您的测试系统
用户空间编程、用户空间和内核空间在用户空间中进行操作在用户空间中进行操作
concurrency in, Concurrency in the Kernel
delaying execution of, Short Delays
execution, Delaying ExecutionShort Delays, Delays
hello world module, The Hello World ModuleThe Hello World Module
inline assembly (example), Processor-Specific Registers
ISA, ISA Programming
kernels, Debugging by Printing (see kernels)
memory (scull), Pitfalls in scull
module requirements, Preliminaries
runtime, Loadable Modules
scilluid, Restricting Access to a Single User at a Time
sleeps, Manual sleeps
test system setup, Setting Up Your Test System
user space programming, User Space and Kernel Space, Doing It in User SpaceDoing It in User Space
一致性、DMA 映射设置相干 DMA 映射
coherency, DMA mappings, Setting up coherent DMA mappings
缓存、DMA 映射
DMA,设置相干 DMA 映射
caches, DMA mappings
DMA, Setting up coherent DMA mappings
命令预准备(块驱动程序)、命令预准备
command pre-preparation (block drivers), Command Pre-Preparation
面向命令的驱动程序,无需 ioctl 的设备控制
command-oriented drivers, Device Control Without ioctl
命令、printkprintk通过观察进行调试通过观察进行调试使用 gdb选择 ioctl 命令预定义命令预定义命令预定义命令预定义命令预定义命令预定义命令的实现ioctl 命令异步通知异步通知异步通知快速参考硬件信息打开和关闭打开和关闭自定义 ioctl 命令自定义 ioctl 命令
commands, printk, printk, Debugging by Watching, Debugging by Watching, Using gdb, Choosing the ioctl Commands, The Predefined Commands, The Predefined Commands, The Predefined Commands, The Predefined Commands, The Predefined Commands, The Predefined Commands, The Implementation of the ioctl Commands, Asynchronous Notification, Asynchronous Notification, Asynchronous Notification, Quick Reference, Hardware Information, Opening and ClosingOpening and Closing, Custom ioctl Commands, Custom ioctl Commands
(另见函数)
dmesg、打印k
FIOASYNC,预定义命令
FIOCLEX,预定义命令
FIONBIO,预定义命令
FIONCLEX,预定义命令
FIOQSIZE,预定义命令
F_SETFL fcntl,异步通知
F_SETOWN,异步通知
gdb,使用 gdb
ifconfig、硬件信息打开和关闭打开和关闭
net_device 结构和硬件信息
打开网络驱动程序,打开和关闭打开和关闭
ioctl、选择 ioctl 命令预定义命令ioctl 命令的实现快速参考自定义 ioctl 命令
创建,快速参考
自定义网络、自定义 ioctl 命令
实现,ioctl 命令的实现
printk, printk(参见 printk 函数)
SIOCDEVPRIVATE,自定义 ioctl 命令
strace,通过观察进行调试
wc,通过观察进行调试
(see also functions)
dmesg, printk
FIOASYNC, The Predefined Commands
FIOCLEX, The Predefined Commands
FIONBIO, The Predefined Commands
FIONCLEX, The Predefined Commands
FIOQSIZE, The Predefined Commands
F_SETFL fcntl, Asynchronous Notification
F_SETOWN, Asynchronous Notification
gdb, Using gdb
ifconfig, Hardware Information, Opening and ClosingOpening and Closing
net_device structure and, Hardware Information
opening network drivers, Opening and ClosingOpening and Closing
ioctl, Choosing the ioctl Commands, The Predefined Commands, The Implementation of the ioctl Commands, Quick Reference, Custom ioctl Commands
creating, Quick Reference
customizing for networking, Custom ioctl Commands
implementation, The Implementation of the ioctl Commands
printk, printk (see printk function)
SIOCDEVPRIVATE, Custom ioctl Commands
strace, Debugging by Watching
wc, Debugging by Watching
与用户空间的通信,Linux 设备模型
communication with user space, The Linux Device Model
编译器、处理器专用寄存器I/O 寄存器和传统内存
compilers, Processor-Specific Registers, I/O Registers and Conventional Memory
gcc,处理器特定寄存器
优化、I/O 寄存器和传统内存
gcc, Processor-Specific Registers
optimizations, I/O Registers and Conventional Memory
编译,编译模块编译模块玩转新设备
compiling, Compiling ModulesCompiling Modules, Playing with the New Devices
字符驱动程序,使用新设备
模块,编译模块编译模块
char drivers, Playing with the New Devices
modules, Compiling ModulesCompiling Modules
完成函数 (urbs),完成 Urbs:完成回调处理程序
complete function (urbs), Completing Urbs: The Completion Callback Handler
完整模块,完成
complete module, Completions
完成、完成完成完成 Urbs:完成回调处理程序与 DMA 控制器对话请求完成函数
completion, CompletionsCompletions, Completing Urbs: The Completion Callback Handler, Talking to the DMA controller, Request Completion Functions
DMA,与 DMA 控制器对话
请求函数、请求完成函数
信号量、完成完成
urbs,完成 Urbs:完成回调处理程序
of DMA, Talking to the DMA controller
request functions, Request Completion Functions
semaphores, CompletionsCompletions
urbs, Completing Urbs: The Completion Callback Handler
并发性、内核中的并发性内核中的并发性scull 中的陷阱并发性及其管理并发性及其管理信号量和互斥体Linux 信号量实现读取器/写入器信号量完成完成自旋锁读取器/写入器自旋锁锁定陷阱细粒度锁定与粗粒度锁定锁定的替代方案读取-复制-更新控制传输并发度控制传输并发
concurrency, Concurrency in the Kernel, Concurrency in the Kernel, Pitfalls in scull, Concurrency and Its ManagementConcurrency and Its Management, Semaphores and Mutexes, The Linux Semaphore ImplementationReader/Writer Semaphores, CompletionsCompletions, SpinlocksReader/Writer Spinlocks, Locking TrapsFine- Versus Coarse-Grained Locking, Alternatives to LockingRead-Copy-Update, Controlling Transmission Concurrency, Controlling Transmission Concurrency
锁定的替代方案,锁定的替代方案读取-复制-更新
控制传输、控制传输并发
调试,内核中的并发
在内核编程中,内核中的并发性
锁定、信号量和互斥体锁定陷阱细粒度锁定与粗粒度锁定
添加、信号量和互斥体
陷阱、锁定陷阱细粒度锁定与粗粒度锁定
管理,并发及其管理并发及其管理
scull(内存故障排除),scull 中的陷阱
信号量、Linux 信号量实现读取器/写入器信号量完成完成
完成,完成完成
实现,Linux 信号量实现读取器/写入器信号量
自旋锁,自旋锁读取器/写入器自旋锁
传输,控制传输并发
alternatives to locking, Alternatives to LockingRead-Copy-Update
controlling transmission, Controlling Transmission Concurrency
debugging, Concurrency in the Kernel
in kernel programming, Concurrency in the Kernel
locking, Semaphores and Mutexes, Locking TrapsFine- Versus Coarse-Grained Locking
adding, Semaphores and Mutexes
traps, Locking TrapsFine- Versus Coarse-Grained Locking
management, Concurrency and Its ManagementConcurrency and Its Management
scull (troubleshooting memory), Pitfalls in scull
semaphores, The Linux Semaphore ImplementationReader/Writer Semaphores, CompletionsCompletions
completion, CompletionsCompletions
implementation, The Linux Semaphore ImplementationReader/Writer Semaphores
spinlocks, SpinlocksReader/Writer Spinlocks
transmission, Controlling Transmission Concurrency
配置、设置测试系统、版本依赖性模块参数-模块参数主编号和次编号、设备编号的内部表示分配和释放设备编号分配和释放设备编号主编号的动态分配字符设备注册内核中的调试支持内核中的调试支持信号量和互斥体超时安装中断处理程序x86 上中断处理的内部结构PCI 寻址配置寄存器和初始化访问配置空间配置设置相干 DMA 映射设置流式 DMA 映射单页流式映射snull 的设计方式-物理传输数据包设备注册接口信息设备方法设备方法典型实现TTY 线路设置,读写控制
configuration, Setting Up Your Test System, Version Dependency, Module ParametersModule Parameters, Major and Minor Numbers, The Internal Representation of Device Numbers, Allocating and Freeing Device Numbers, Allocating and Freeing Device Numbers, Dynamic Allocation of Major Numbers, Char Device Registration, Debugging Support in the KernelDebugging Support in the Kernel, Semaphores and Mutexes, Timeouts, Installing an Interrupt HandlerThe internals of interrupt handling on the x86, PCI Addressing, Configuration Registers and Initialization, Accessing the Configuration Space, Configurations, Setting up coherent DMA mappings, Setting up streaming DMA mappings, Single-page streaming mappings, How snull Is DesignedThe Physical Transport of Packets, Device Registration, Interface InformationThe Device Methods, The Device Methods, A Typical Implementation, TTY Line Settings, ioctls
cdev结构体,Char设备注册
字符驱动程序,分配和释放设备号
字符 (char) 驱动程序、主编号和次编号设备编号的内部表示分配和释放设备编号主编号的动态分配
(另请参阅字符驱动程序)
动态分配主号码, Dynamic Allocation of Major Numbers
设备编号的内部表示,设备编号的内部表示
主要/次要号码、主要和次要号码
相干 DMA 映射,设置相干 DMA 映射
关键部分、信号量和互斥体
ether_setup 函数,接口信息设备方法
中断处理程序,安装中断处理程序x86 上中断处理的内部结构
内核,内核中的调试支持内核中的调试支持
线路设置(tty 驱动程序)、TTY 线路设置
多播,典型实现
网络设备,设备方法
net_device结构体,设备注册
参数分配,模块参数模块参数
PCI、PCI 寻址配置寄存器和初始化访问配置空间
访问配置空间,访问配置空间
寄存器、配置寄存器和初始化
串行线、ioctl
单页流式映射,单页流式映射
snull 驱动程序,snull 是如何设计的数据包的物理传输
流 DMA 映射,设置流 DMA 映射
测试系统设置,设置您的测试系统
超时,超时
USB 接口、配置
版本依赖,版本依赖
cdev structure, Char Device Registration
char drivers, Allocating and Freeing Device Numbers
character (char) drivers, Major and Minor Numbers, The Internal Representation of Device Numbers, Allocating and Freeing Device Numbers, Dynamic Allocation of Major Numbers
(see also char drivers)
dynamic allocation of major numbers, Dynamic Allocation of Major Numbers
internal representation of device numbers, The Internal Representation of Device Numbers
major/minor numbers, Major and Minor Numbers
coherent DMA mappings, Setting up coherent DMA mappings
critical sections, Semaphores and Mutexes
ether_setup function, Interface InformationThe Device Methods
interrupt handlers, Installing an Interrupt HandlerThe internals of interrupt handling on the x86
kernels, Debugging Support in the KernelDebugging Support in the Kernel
line settings (tty drivers), TTY Line Settings
multicasting, A Typical Implementation
network devices, The Device Methods
net_device structure, Device Registration
parameter assignment, Module ParametersModule Parameters
PCI, PCI Addressing, Configuration Registers and Initialization, Accessing the Configuration Space
accessing configuration space, Accessing the Configuration Space
registers, Configuration Registers and Initialization
serial lines, ioctls
single-page streaming mappings, Single-page streaming mappings
snull drivers, How snull Is DesignedThe Physical Transport of Packets
streaming DMA mappings, Setting up streaming DMA mappings
test system setup, Setting Up Your Test System
timeouts, Timeouts
USB interfaces, Configurations
version dependency, Version Dependency
CONFIG_ACPI_DEBUG 选项,内核中的调试支持
CONFIG_ACPI_DEBUG option, Debugging Support in the Kernel
CONFIG_DEBUG_DRIVER 选项,内核中的调试支持
CONFIG_DEBUG_DRIVER option, Debugging Support in the Kernel
CONFIG_DEBUG_INFO 选项,内核中的调试支持
CONFIG_DEBUG_INFO option, Debugging Support in the Kernel
CONFIG_DEBUG_KERNEL 选项,内核中的调试支持
CONFIG_DEBUG_KERNEL option, Debugging Support in the Kernel
CONFIG_DEBUG_PAGEALLOC 选项,内核中的调试支持
CONFIG_DEBUG_PAGEALLOC option, Debugging Support in the Kernel
CONFIG_DEBUG_SLAB 选项,内核中的调试支持
CONFIG_DEBUG_SLAB option, Debugging Support in the Kernel
CONFIG_DEBUG_SPINLOCK 选项,内核中的调试支持
CONFIG_DEBUG_SPINLOCK option, Debugging Support in the Kernel
CONFIG_DEBUG_SPINLOCK_SLEEP 选项,内核中的调试支持
CONFIG_DEBUG_SPINLOCK_SLEEP option, Debugging Support in the Kernel
CONFIG_DEBUG_STACKOVERFLOW 选项,内核中的调试支持
CONFIG_DEBUG_STACKOVERFLOW option, Debugging Support in the Kernel
CONFIG_DEBUG_STACK_USAGE 选项,内核中的调试支持
CONFIG_DEBUG_STACK_USAGE option, Debugging Support in the Kernel
CONFIG_IKCONFIG 选项,内核中的调试支持
CONFIG_IKCONFIG option, Debugging Support in the Kernel
CONFIG_IKCONFIG_PROC 选项,内核中的调试支持
CONFIG_IKCONFIG_PROC option, Debugging Support in the Kernel
CONFIG_INIT_DEBUG 选项,内核中的调试支持
CONFIG_INIT_DEBUG option, Debugging Support in the Kernel
CONFIG_INPUT_EVBUG 选项,内核中的调试支持
CONFIG_INPUT_EVBUG option, Debugging Support in the Kernel
CONFIG_KALLSYMS 选项,内核中的调试支持
CONFIG_KALLSYMS option, Debugging Support in the Kernel
CONFIG_MAGIC_SYSRQ 选项,内核中的调试支持
CONFIG_MAGIC_SYSRQ option, Debugging Support in the Kernel
CONFIG_PROFILING 选项,内核中的调试支持
CONFIG_PROFILING option, Debugging Support in the Kernel
CONFIG_SCSI_CONSTANTS 选项,内核中的调试支持
CONFIG_SCSI_CONSTANTS option, Debugging Support in the Kernel
CONFIG_USB_DYNAMIC_MINORS 配置选项、探测和断开详细信息
CONFIG_USB_DYNAMIC_MINORS configuration option, probe and disconnect in Detail
连接、创建 /proc 文件PCI 驱动程序USB 驱动程序IEEE1394 (FireWire)USB分配 IP 编号连接到内核实用程序字段
connections, Creating your /proc file, PCI Drivers, USB Drivers, IEEE1394 (FireWire), USB, Assigning IP Numbers, Connecting to the KernelUtility Fields
(另请参阅热插拔)
火线、IEEE1394(火线)
IP 号码、分配 IP 号码
网络驱动程序到内核,连接到内核-实用程序字段
PCI、PCI 驱动程序(参见 PCI)
/proc 文件层次结构,创建 /proc 文件
USB、USB 驱动程序(请参阅 USB)
(see also hotplugs)
Firewire, IEEE1394 (FireWire)
IP numbers, Assigning IP Numbers
network drivers to kernels, Connecting to the KernelUtility Fields
PCI, PCI Drivers (see PCI)
/proc file hierarchies, Creating your /proc file
USB, USB Drivers (see USB)
连接器 (ISA)、VLB
connectors (ISA), VLB
控制台、重定向控制台消息不使用 ioctl 的设备控制
consoles, Redirecting Console Messages, Device Control Without ioctl
消息(重定向),重定向控制台消息
字体错误,设备控制没有 ioctl
messages (redirecting), Redirecting Console Messages
wrong font on, Device Control Without ioctl
console_loglevel 变量,printk系统挂起
console_loglevel variable, printk, System Hangs
调试系统挂起,系统挂起
debugging system hangs, System Hangs
const char *dev_name 函数,安装中断处理程序
const char *dev_name functions, Installing an Interrupt Handler
const char *name 字段(PCI 注册),注册 PCI 驱动程序
const char *name field (PCI registration), Registering a PCI Driver
const char *name 函数,注册 USB 驱动程序
const char *name function, Registering a USB Driver
const struct pci_device_id *id_table 字段(PCI 注册),注册 PCI 驱动程序
const struct pci_device_id *id_table field (PCI registration), Registering a PCI Driver
const struct usb_device_id *id_table 函数,注册 USB 驱动程序
const struct usb_device_id *id_table function, Registering a USB Driver
构造函数(kmem_cache_create),Lookaside Caches
constructor function (kmem_cache_create), Lookaside Caches
控制端点 (USB)、端点
CONTROL endpoints (USB), Endpoints
控制函数(队列),队列控制函数
control functions (queues), Queue control functions
控制 urb (USB),控制 urb
control urbs (USB), Control urbs
控制器 (PCI)、硬件抽象
controllers (PCI), Hardware Abstractions
控制、无 ioctl 的设备控制提交和控制 Urb控制传输并发
controlling, Device Control Without ioctl, Submitting and Controlling a Urb, Controlling Transmission Concurrency
传输并发,控制传输并发
urbs (USB),提交和控制 Urb
通过编写控制序列,无需 ioctl 的设备控制
transmission concurrency, Controlling Transmission Concurrency
urbs (USB), Submitting and Controlling a Urb
by writing control sequences, Device Control Without ioctl
传统存储器 I/O 寄存器、I/O 寄存器和传统存储器I/O 寄存器和传统存储器
conventional memory I/O registers, I/O Registers and Conventional Memory, I/O Registers and Conventional Memory
(另见内存)
(see also memory)
转换(虚拟地址),总线地址
conversion (virtual addresses), Bus Addresses
复制(跨空间)、读取和写入
copying (cross-space), read and write
核心文件,使用 gdb
core files, Using gdb
计数器、测量时间间隔特定于处理器的寄存器特定于处理器的寄存器引用计数操作
counters, Measuring Time Lapses, Processor-Specific Registers, Processor-Specific Registers, Reference count manipulation
jiffies,测量时间流逝
引用(kobjects),引用计数操作
寄存器、处理器特定寄存器
TSC,处理器特定寄存器
jiffies, Measuring Time Lapses
reference (kobjects), Reference count manipulation
registers, Processor-Specific Registers
TSC, Processor-Specific Registers
计数(中断)、ioctl
counts (interrupts), ioctls
CPU 模式(级别)、用户空间和内核空间
CPU modalities (levels), User Space and Kernel Space
create_module 系统调用、vmalloc 和朋友
create_module system call, vmalloc and Friends
create_proc_read_entry 函数,创建 /proc 文件
create_proc_read_entry function, Creating your /proc file
创建、创建和销毁 Urbs队列创建和删除
creating, Creating and Destroying Urbs, Queue creation and deletion
队列,队列的创建和删除
urbs (USB),创建和销毁 Urbs
queues, Queue creation and deletion
urbs (USB), Creating and Destroying Urbs
关键部分、信号量和互斥体
critical sections, Semaphores and Mutexes
跨空间复制、读写
cross-space copying, read and write
CRTSCTS 位掩码、set_termios
CRTSCTS bitmask, set_termios
CSIZE 位掩码,set_termios
CSIZE bitmask, set_termios
CSN(卡选择号),即插即用规范
CSN (card select number), The Plug-and-Play Specification
CSTOPB 位掩码,set_termios
CSTOPB bitmask, set_termios
当前流程、当前流程当前流程快速参考
current process, The Current Process, The Current Process, Quick Reference
当前时间、检索、了解当前时间了解当前时间
current time, retrieving, Knowing the Current TimeKnowing the Current Time
current.h头文件,当前进程
current.h header file, The Current Process
currentime 文件(jit 模块),了解当前时间
currentime file (jit module), Knowing the Current Time
自定义、接口特定类型自定义 ioctl 命令
custom, Interface-Specific Types, Custom ioctl Commands
数据类型、接口特定类型
用于网络的 ioctl 方法,自定义 ioctl 命令
data types, Interface-Specific Types
ioctl methods for networking, Custom ioctl Commands
Cycles_t 类型,处理器特定寄存器
cycles_t type, Processor-Specific Registers

D

D

守护进程、Hello World 模块printk如何记录消息
daemons, The Hello World Module, printk, How Messages Get Logged
klogd、Hello World 模块printk
syslogd,如何记录消息
klogd, The Hello World Module, printk
syslogd, How Messages Get Logged
数据、为数据项分配显式大小数据对齐直接内存访问与 DMA 控制器对话数据包的物理传输
data, Assigning an Explicit Size to Data Items, Data Alignment, Direct Memory AccessTalking to the DMA controller, The Physical Transport of Packets
显式调整大小,为数据项分配显式大小
物理数据包传输,数据包的物理传输
使用 DMA、直接内存访问进行传输–与 DMA 控制器对话
未对齐、可移植性和数据对齐
explicitly sizing, Assigning an Explicit Size to Data Items
physical packet transport, The Physical Transport of Packets
transferring with DMA, Direct Memory AccessTalking to the DMA controller
unaligned, portability and, Data Alignment
数据属性(固件),工作原理
data attribute (firmware), How It Works
数据功能 (USB)、其他 USB 数据功能
data functions (USB), Other USB Data Functions
数据结构、一些重要的数据结构文件操作文件操作数据对齐
data structures, Some Important Data Structures, File OperationsFile Operations, Data Alignment
文件操作,文件操作文件操作
数据对齐的可移植性
file operations, File OperationsFile Operations
portability of, Data Alignment
数据类型,标准 C 类型的使用,标准 C 类型的使用,标准 C 类型的使用,标准 C 类型的使用,为数据项分配显式大小,为数据项分配显式大小,为数据项分配显式大小,接口特定类型,接口特定类型
data types, Use of Standard C Types, Use of Standard C Types, Use of Standard C Types, Use of Standard C Types, Assigning an Explicit Size to Data Items, Assigning an Explicit Size to Data Items, Assigning an Explicit Size to Data Items, Interface-Specific Types, Interface-Specific Types
对于显式调整数据大小,为数据项分配显式大小
inptr_t(C99 标准),标准 C 类型的使用
int,标准 C 类型的使用
接口特定的、接口特定的类型
I/O 函数的松散类型,接口特定类型
不同的混合,使用标准 C 类型
标准C类型,标准 C 类型的使用
u8、u16、u32、u64、为数据项分配显式大小
uint8_t/unit32_t,为数据项分配显式大小
for explicitly sizing data, Assigning an Explicit Size to Data Items
inptr_t (C99 standard), Use of Standard C Types
int, Use of Standard C Types
interface-specific, Interface-Specific Types
loose typing for I/O functions, Interface-Specific Types
mixing different, Use of Standard C Types
standard C types, Use of Standard C Types
u8, u16, u32, u64, Assigning an Explicit Size to Data Items
uint8_t/unit32_t, Assigning an Explicit Size to Data Items
dataalign程序,数据对齐
dataalign program, Data Alignment
datasize 程序,标准 C 类型的使用
datasize program, Use of Standard C Types
dd 实用程序和 scull 驱动程序示例,scull 的内存使用情况
dd utility and scull driver example, scull's Memory Usage
截止时间调度程序 (I/O)、请求队列
deadline schedulers (I/O), Request Queues
死锁避免、自旋锁锁排序规则
deadlocks avoiding, Spinlocks, Lock Ordering Rules
(另请参阅锁定)
(see also locking)
调试、内核中的并发性内核中的调试支持内核中的调试支持、内核中的调试支持、通过打印进行调试打印设备编号打开和关闭消息打开和关闭消息通过查询进行调试ioctl 方法ioctl 方法ioctl 方法通过观察进行调试调试系统故障系统挂起系统挂起调试器和相关工具kgdb 补丁用户模式 ​​Linux 端口Linux 跟踪工具包动态探针处理程序参数和返回值
debugging, Concurrency in the Kernel, Debugging Support in the KernelDebugging Support in the Kernel, Debugging Support in the Kernel, Debugging by PrintingPrinting Device Numbers, Turning the Messages On and Off, Turning the Messages On and Off, Debugging by QueryingThe ioctl Method, The ioctl Method, The ioctl Method, Debugging by Watching, Debugging System Faults, System Hangs, System Hangs, Debuggers and Related Tools, The kgdb Patches, The User-Mode Linux Port, The Linux Trace Toolkit, Dynamic Probes, Handler Arguments and Return Value
(另请参阅故障排除)
并发,内核中的并发
使用调试器、调试器和相关工具
使用动态探头、动态探头
中断处理程序、处理程序参数和返回值
与 ioctl 方法, ioctl 方法
内核、内核中的调试支持内核中的调试支持通过打印进行调试打印设备编号通过查询进行调试ioctl 方法通过观看进行调试
监控、观察调试
通过打印,通过打印调试打印设备编号
通过查询,通过查询进行调试 ioctl 方法
支持,内核中的调试支持内核中的调试支持
使用 kgdb,kgdb 补丁
级别(实施),打开和关闭消息
使用 LTT(Linux 跟踪工具包)
键盘锁定,系统挂起
通过打印、打开和关闭消息
通过查询,ioctl方法
系统故障、调试系统故障
系统挂起,系统挂起
使用用户模式 ​​Linux,用户模式 ​​Linux 端口
(see also troubleshooting)
concurrency, Concurrency in the Kernel
using a debugger, Debuggers and Related Tools
using Dynamic Probes, Dynamic Probes
interrupt handlers, Handler Arguments and Return Value
with ioctl method, The ioctl Method
kernels, Debugging Support in the KernelDebugging Support in the Kernel, Debugging by PrintingPrinting Device Numbers, Debugging by QueryingThe ioctl Method, Debugging by Watching
monitoring, Debugging by Watching
by printing, Debugging by PrintingPrinting Device Numbers
by querying, Debugging by QueryingThe ioctl Method
support, Debugging Support in the KernelDebugging Support in the Kernel
using kgdb, The kgdb Patches
levels (implementation of), Turning the Messages On and Off
using LTT, The Linux Trace Toolkit
locked keyboard, System Hangs
by printing, Turning the Messages On and Off
by querying, The ioctl Method
system faults, Debugging System Faults
system hangs, System Hangs
using User-Mode Linux, The User-Mode Linux Port
数组参数的声明、模块参数
declaration of array parameters, Module Parameters
DECLARE_TASKLET 宏,Tasklet
DECLARE_TASKLET macro, Tasklets
默认属性(kobjects),默认属性
default attributes (kobjects), Default Attributes
default_attrs 字段(kobjects),默认属性
default_attrs field (kobjects), Default Attributes
DEFAULT_CONSOLE_LOGLEVEL,打印k
DEFAULT_CONSOLE_LOGLEVEL, printk
DEFAULT_MESSAGE_LOGLEVEL,打印k
DEFAULT_MESSAGE_LOGLEVEL, printk
延迟执行代码,延迟执行短延迟延迟
delaying execution of code, Delaying ExecutionShort Delays, Delays
删除、创建 /proc 文件非默认属性符号链接总线属性删除设备删除驱动程序设置流 DMA 映射队列创建和删除
deleting, Creating your /proc file, Nondefault Attributes, Symbolic Links, Bus attributes, Remove a Device, Remove a Driver, Setting up streaming DMA mappings, Queue creation and deletion
属性、非默认属性总线属性
设备,删除设备
驱动程序,删除驱动程序
映射 (DMA),设置流 DMA 映射
/proc 文件,创建 /proc 文件
队列,队列的创建和删除
符号链接,符号链接
attributes, Nondefault Attributes, Bus attributes
devices, Remove a Device
drivers, Remove a Driver
mappings (DMA), Setting up streaming DMA mappings
/proc files, Creating your /proc file
queues, Queue creation and deletion
symbolic links, Symbolic Links
del_timer_sync 函数,计时器 API
del_timer_sync function, The Timer API
dentry字段(文件结构),文件结构
dentry field (file structure), The file Structure
依赖关系、版本依赖关系平台依赖关系
dependency, Version Dependency, Platform Dependency
平台、平台依赖性
版本、版本依赖
platform, Platform Dependency
version, Version Dependency
取消引用内存地址,标准 C 类型的使用
dereferencing memory addresses, Use of Standard C Types
描述符 (USB)、其他 USB 数据功能
descriptors (USB), Other USB Data Functions
设计、设备驱动程序的作用scull 的设计scull 的设计并发及其管理并发及其管理
design, The Role of the Device Driver, The Design of scull, The Design of scull, Concurrency and Its ManagementConcurrency and Its Management
(另请参阅配置)
并发、并发及其管理并发及其管理
无策略驱动程序,设备驱动程序的作用
橹的设计,橹的设计
(see also configuration)
concurrency, Concurrency and Its ManagementConcurrency and Its Management
policy-free drivers, The Role of the Device Driver
of scull, The Design of scull
台式机、PCI 驱动程序USB 驱动程序
desktops, PCI Drivers, USB Drivers
PCI、PCI 驱动程序(参见 PCI)
USB、USB 驱动程序(请参阅 USB)
PCI, PCI Drivers (see PCI)
USB, USB Drivers (see USB)
销毁 urbs (USB),创建和销毁 Urbs
destroying urbs (USB), Creating and Destroying Urbs
析构函数 (kmem_cache_create)、Lookaside Caches
destructor function (kmem_cache_create), Lookaside Caches
/dev 目录,主要和次要数字
/dev directory, Major and Minor Numbers
/dev 节点、设备和模块类主编号和次编号主编号的动态分配安装中断处理程序安装中断处理程序
/dev nodes, Classes of Devices and Modules, Major and Minor Numbers, Dynamic Allocation of Major Numbers, Installing an Interrupt Handler, Installing an Interrupt Handler
字符设备以及主要和次要数字
/dev/random 设备,安装中断处理程序
/dev/urandom 设备,安装中断处理程序
动态主号码分配,Dynamic Allocation of Major Numbers
char devices and, Major and Minor Numbers
/dev/random device, Installing an Interrupt Handler
/dev/urandom device, Installing an Interrupt Handler
dynamic major number allocation, Dynamic Allocation of Major Numbers
/dev 树,udev
/dev tree, udev
开发社区(内核),加入,加入内核开发社区
development community (kernel), joining, Joining the Kernel Development Community
开发内核、版本编号
development kernels, Version Numbering
设备属性(固件),工作原理
device attribute (firmware), How It Works
设备变量,USB
DEVICE variable, USB
deviceID 寄存器 (PCI)、配置寄存器和初始化
deviceID register (PCI), Configuration Registers and Initialization
devices、分割内核scull 的设计主次编号主次编号设备编号的内部表示、分配和释放设备编号分配和释放设备编号动态分配主编号动态分配主编号数字文件操作旧方法打开方法读写打印设备编号并发及其管理并发及其管理ioctl没有 ioctl 的设备控制没有 ioctl 的设备控制阻塞 I/O 示例测试 Scullpipe 驱动程序从设备读取数据写入设备查找设备设备文件的访问控制,单开设备,单开设备,单开设备,在开放上克隆设备,硬件资源, USB 驱动程序, Linux 设备模型Linux 设备模型 Linux 设备模型 Linux设备模型Kobject、Kset 和子系统子系统低级 Sysfs 操作符号链接热插拔事件生成、总线总线属性迭代设备和驱动程序、设备驱动程序结构嵌入设备注册设备属性设备结构嵌入设备驱动程序类接口、、综合类设备-删除驱动程序添加设备删除设备热插拔- udev动态设备网络输入SCSI处理固件-工作原理总线、设备和驱动程序、使用remap_pfn_range ,网络驱动程序,设备注册,初始化每个设备,设备方法,设备方法
devices, Splitting the Kernel, The Design of scull, Major and Minor Numbers, Major and Minor Numbers, The Internal Representation of Device Numbers, Allocating and Freeing Device Numbers, Allocating and Freeing Device Numbers, Dynamic Allocation of Major Numbers, Dynamic Allocation of Major Numbers, File Operations, The Older Way, The open Method, read and write, Printing Device Numbers, Concurrency and Its ManagementConcurrency and Its Management, ioctlDevice Control Without ioctl, Device Control Without ioctl, A Blocking I/O ExampleTesting the Scullpipe Driver, Reading data from the device, Writing to the device, Seeking a Device, Access Control on a Device File, Single-Open Devices, Single-Open Devices, Single-Open Devices, Cloning the Device on open, Hardware Resources, USB Drivers, The Linux Device ModelThe Linux Device Model, The Linux Device Model, The Linux Device Model, Kobjects, Ksets, and SubsystemsSubsystems, Low-Level Sysfs OperationsSymbolic Links, Hotplug Event Generation, BusesBus attributes, Iterating over devices and drivers, DevicesDriver structure embedding, Device registration, Device attributes, Device structure embedding, Device Drivers, ClassesClass interfaces, Class devices, Putting It All TogetherRemove a Driver, Add a Device, Remove a Device, Hotplugudev, Dynamic Devices, Networking, Input, SCSI, Dealing with FirmwareHow It Works, Buses, Devices, and Drivers, Using remap_pfn_range, Network Drivers, Device Registration, Initializing Each Device, The Device Methods, The Device Methods
(另请参阅驱动程序)
文件访问、设备文件访问控制
添加,添加设备
号码分配、分配和释放设备号码
属性、设备属性
缓存问题,使用 remap_pfn_range
字符驱动程序,旧方法(请参阅字符驱动程序)
Linux 设备模型类设备的类别
克隆,在打开时克隆设备
并发、并发及其管理并发及其管理
控制操作,分裂内核
删除、移除设备
驱动程序、设备驱动程序
动态、动态设备
动态分配主号码, Dynamic Allocation of Major Numbers
FIFO,scull的设计
文件操作,文件操作
文件、主要编号和次要编号
函数、总线、设备和驱动程序
可热插拔,Linux 设备模型
使用 ls 命令识别类型、主要数字和次要数字
初始化,初始化每个设备
输入(热插拔),输入
数字的内部表示,设备编号的内部表示
ioctl 方法,ioctl不使用 ioctl 的设备控制
ISA、硬件资源
迭代,迭代设备和驱动程序
Linux 设备模型、Linux 设备模型Linux 设备模型Kobject、Kset 和子系统子系统低级 Sysfs 操作符号链接热插拔事件生成总线总线属性设备驱动程序结构嵌入类接口将它们放在一起删除驱动程序热插拔udev处理固件工作原理
总线、总线总线属性
类,——类接口
固件,处理固件它是如何工作的
热插拔事件,热插拔事件生成
热插拔,热插拔udev
kobjects、Kobjects、Ksets 和子系统子系统
生命周期,将它们放在一起删除驱动程序
低级 sysfs 操作,低级 Sysfs 操作符号链接
方法,设备方法
主要号码的动态分配的名称
网络、网络
网络驱动程序,网络驱动程序
数字(打印),打印设备编号
操作,设备方法
读和写,读和写
读取数据,从设备读取数据
注册、设备注册设备注册
SCSI、SCSI
scullpipe(示例),阻塞 I/O 示例测试 Scullpipe 驱动程序
scullsingle、单开装置
寻找,寻找设备
单开、单开设备单开设备
结构(嵌入),设备结构嵌入
打开时截断,open 方法
USB、USB 驱动程序(请参阅 USB)
写入,不使用 ioctl 的设备控制写入设备
控制序列,不带 ioctl 的设备控制
数据到,写入设备
(see also drivers)
access to files, Access Control on a Device File
adding, Add a Device
allocation of numbers, Allocating and Freeing Device Numbers
attributes, Device attributes
caching problems, Using remap_pfn_range
char drivers, The Older Way (see char drivers)
classes of, The Linux Device Model, Class devices
cloning, Cloning the Device on open
concurrency, Concurrency and Its ManagementConcurrency and Its Management
control operations, Splitting the Kernel
deleting, Remove a Device
drivers, Device Drivers
dynamic, Dynamic Devices
dynamic allocation of major numbers, Dynamic Allocation of Major Numbers
FIFO, The Design of scull
file operations on, File Operations
files, Major and Minor Numbers
functions, Buses, Devices, and Drivers
hotpluggable, The Linux Device Model
identifying type with ls command, Major and Minor Numbers
initialization, Initializing Each Device
input (hotplugging), Input
internal representation of numbers, The Internal Representation of Device Numbers
ioctl method, ioctlDevice Control Without ioctl
ISA, Hardware Resources
iteration, Iterating over devices and drivers
Linux device model, The Linux Device ModelThe Linux Device Model, Kobjects, Ksets, and SubsystemsSubsystems, Low-Level Sysfs OperationsSymbolic Links, Hotplug Event Generation, BusesBus attributes, DevicesDriver structure embedding, ClassesClass interfaces, Putting It All TogetherRemove a Driver, Hotplugudev, Dealing with FirmwareHow It Works
buses, BusesBus attributes
classes, ClassesClass interfaces
firmware, Dealing with FirmwareHow It Works
hotplug events, Hotplug Event Generation
hotplugging, Hotplugudev
kobjects, Kobjects, Ksets, and SubsystemsSubsystems
lifecycles, Putting It All TogetherRemove a Driver
low-level sysfs operations, Low-Level Sysfs OperationsSymbolic Links
methods, The Device Methods
names of, Dynamic Allocation of Major Numbers
network, Networking
network drivers, Network Drivers
numbers (printing), Printing Device Numbers
operations, The Device Methods
reading and writing, read and write
reading data from, Reading data from the device
registration, Device registration, Device Registration
SCSI, SCSI
scullpipe (example), A Blocking I/O ExampleTesting the Scullpipe Driver
scullsingle, Single-Open Devices
seeking, Seeking a Device
single-open, Single-Open Devices, Single-Open Devices
structures (embedding), Device structure embedding
truncating on open, The open Method
USB, USB Drivers (see USB)
writing, Device Control Without ioctl, Writing to the device
control sequences to, Device Control Without ioctl
data to, Writing to the device
DEVPATH 变量,/sbin/hotplug 实用程序
DEVPATH variable, The /sbin/hotplug Utility
dev_alloc_skb 函数,作用于套接字缓冲区的函数
dev_alloc_skb function, Functions Acting on Socket Buffers
dev_id 指针(安装共享处理程序),安装共享处理程序
dev_id pointer (installing shared handlers), Installing a Shared Handler
dev_kfree_skb 函数,中断处理程序作用于套接字缓冲区的函数
dev_kfree_skb function, The Interrupt Handler, Functions Acting on Socket Buffers
dev_mc_list 结构,内核对多播的支持
dev_mc_list structure, Kernel Support for Multicasting
dev_t i_rdev(inode结构字段),inode结构
dev_t i_rdev (inode structure field), The inode Structure
直接 I/O、执行直接 I/O异步 I/O 示例执行直接 I/O实现直接 I/O
direct I/O, Performing Direct I/OAn asynchronous I/O example, Performing Direct I/O, Implementing Direct I/O
(另请参见 I/O)
实施,实施直接 I/O
(see also I/O)
implementation, Implementing Direct I/O
直接内存访问、内存映射和 DMA(请参阅 DMA)
direct memory access, Memory Mapping and DMA (see DMA)
目录、内核模块与应用程序主要和次要编号文件结构重定向控制台消息创建 /proc 文件测试 Scullpipe 驱动程序USB 和 SysfsUSB 和 Sysfs低级 Sysfs 操作符号链接TTY 驱动程序struct termiosproc 和 sysfs TTY 设备的处理
directories, Kernel Modules Versus Applications, Major and Minor Numbers, The file Structure, Redirecting Console Messages, Creating your /proc file, Testing the Scullpipe Driver, USB and SysfsUSB and Sysfs, Low-Level Sysfs OperationsSymbolic Links, TTY Drivers, struct termios, proc and sysfs Handling of TTY Devices
/dev,主要和次要数字
条目(文件结构),文件结构
内核头文件、内核模块与应用程序
Misc-progs 源、重定向控制台消息测试 Scullpipe 驱动程序
/proc 文件层次结构连接,创建 /proc 文件
/proc/tty/driver, TTY 驱动程序
sysfs、USB 和 SysfsUSB 和 Sysfs低级 Sysfs 操作符号链接struct termios
低级操作,低级 Sysfs 操作符号链接
tty 驱动程序,结构 termios
USB、USB 和 SysfsUSB 和 Sysfs
tty 驱动程序、proc 和 sysfs TTY 设备的处理
/dev, Major and Minor Numbers
entries (file structure), The file Structure
of kernel headers, Kernel Modules Versus Applications
misc-progs source, Redirecting Console Messages, Testing the Scullpipe Driver
/proc file hierarchy connections, Creating your /proc file
/proc/tty/driver, TTY Drivers
sysfs, USB and SysfsUSB and Sysfs, Low-Level Sysfs OperationsSymbolic Links, struct termios
low-level operations, Low-Level Sysfs OperationsSymbolic Links
tty driver, struct termios
USB, USB and SysfsUSB and Sysfs
tty drivers, proc and sysfs Handling of TTY Devices
*dir_notify方法,文件操作
*dir_notify method, File Operations
disable_dma 函数,与 DMA 控制器对话
disable_dma function, Talking to the DMA controller
disable_irq 函数,安装共享处理程序
disable_irq function, Installing a Shared Handler
禁用、打开和关闭消息启用和禁用中断控制传输并发
disabling, Turning the Messages On and Off, Enabling and Disabling Interrupts, Controlling Transmission Concurrency
中断处理程序,启用和禁用中断
数据包传输,控制传输并发
打印语句,打开和关闭消息
interrupt handlers, Enabling and Disabling Interrupts
packet transmissions, Controlling Transmission Concurrency
print statements, Turning the Messages On and Off
数据泄露、安全问题
disclosure of data, Security Issues
断开连接功能(USB),注册USB驱动程序探测和断开连接详细信息
disconnect function (USB), Registering a USB Driver, probe and disconnect in Detail
磁盘、文件结构磁盘注册gendisk 结构
disks, The file Structure, Disk Registration, The gendisk structure
文件与打开的文件,文件结构
freeing,gendisk 结构
注册、磁盘注册
files versus open files, The file Structure
freeing, The gendisk structure
registration, Disk Registration
分发、编写驱动程序、平台依赖性
distribution, writing drivers for, Platform Dependency
DMA(直接内存访问)、直接内存访问直接内存访问DMA 数据传输概述分散/聚集映射简单的 PCI DMA 示例注册 DMA 使用直接内存访问块请求和 DMA
DMA (direct memory access), Direct Memory Access, Direct Memory Access, Overview of a DMA Data Transfer, Scatter/gather mappings, A simple PCI DMA example, Registering DMA usage, Direct Memory Access, Block requests and DMA
块请求和块请求和 DMA
映射(分散-聚集)、分散/聚集映射
PCI 设备和,一个简单的 PCI DMA 示例
注册使用情况、注册 DMA 使用情况
环形缓冲区,DMA 数据传输概述
block requests and, Block requests and DMA
mappings (scatter-gather), Scatter/gather mappings
PCI devices and, A simple PCI DMA example
registering usage, Registering DMA usage
ring buffers, Overview of a DMA Data Transfer
支持 DMA 的内存区域、内存区域后备缓存
DMA-capable memory zone, Memory zones, Lookaside Caches
SLAB_CACHE_DMA 标志和后备缓存
SLAB_CACHE_DMA flag and, Lookaside Caches
dma.h 头文件,注册 DMA 使用
dma.h header file, Registering DMA usage
DMAC(DMA 控制器),ISA 设备的 DMA
DMAC (DMA controller), DMA for ISA Devices
dma_addr_t setup_dma 字段(USB),结构 urb
dma_addr_t setup_dma field (USB), struct urb
dma_addr_t 传输_dma 字段(USB),结构 urb
dma_addr_t transfer_dma field (USB), struct urb
DMA_BIDIRECTIONAL 符号,设置流 DMA 映射直接内存访问
DMA_BIDIRECTIONAL symbol, Setting up streaming DMA mappings, Direct Memory Access
dma_free_coherent 函数,设置相干 DMA 映射
dma_free_coherent function, Setting up coherent DMA mappings
DMA_FROM_DEVICE 符号、设置流式 DMA 映射设置流式 DMA 映射直接内存访问
DMA_FROM_DEVICE symbol, Setting up streaming DMA mappings, Setting up streaming DMA mappings, Direct Memory Access
DMA_NONE 符号,设置流 DMA 映射直接内存访问
DMA_NONE symbol, Setting up streaming DMA mappings, Direct Memory Access
dma_spin_lock,与 DMA 控制器对话
dma_spin_lock, Talking to the DMA controller
DMA_TO_DEVICE 符号,设置流 DMA 映射直接内存访问
DMA_TO_DEVICE symbol, Setting up streaming DMA mappings, Direct Memory Access
dmesg 命令、printk
dmesg command, printk
自己动手探索,自己动手探索
do-it-yourself probing, Do-it-yourself probing
双下划线 (__) 函数,一些其他细节
double underscore (__) functions, A Few Other Details
双地址周期映射 (PCI), PCI 双地址周期映射
double-address cycle mappings (PCI), PCI double-address cycle mappings
双向链表(可移植性)、链表快速参考
doubly linked lists (portability), Linked Lists, Quick Reference
down 函数,Linux 信号量实现
down function, The Linux Semaphore Implementation
do_close函数,打开和关闭
do_close function, open and close
do_gettimeofday 函数,了解当前时间
do_gettimeofday function, Knowing the Current Time
do_ioctl 方法、设备方法自定义 ioctl 命令
do_ioctl method, The Device Methods, Custom ioctl Commands
do_IRQ 函数,x86 上中断处理的内部结构
do_IRQ function, The internals of interrupt handling on the x86
drivers,设备驱动程序的角色,设备驱动程序的角色-设备驱动程序的角色,设备和模块的类,设备和模块的类,设备和模块的类,设备和模块的类,安全问题,在用户空间中进行操作scull 的设计文件操作并发性和竞争条件选择 ioctl 命令不使用 ioctl 进行设备控制驱动程序的观点基于 Slab 缓存的 scull: scullc使用整个页面的 scull : scullp使用虚拟地址的 scull : scullv快速参考示例驱动程序重用 I/O 内存缩写安装中断处理程序动手实践自己探测实现处理程序写缓冲示例写缓冲示例总线方法迭代设备和驱动程序设备驱动程序驱动程序结构嵌入驱动程序结构嵌入添加驱动程序删除驱动程序总线、设备和驱动程序块驱动程序sbull 中的初始化简单请求方法网络驱动程序snull 的设计方式数据包的物理传输连接到内核实用程序字段打开和关闭打开和关闭中断处理程序链路状态的更改MAC 地址解析非以太网标头自定义 ioctl 命令统计信息多播典型实现快速参考快速参考TTY 驱动程序小型 TTY 驱动程序struct termiosstruct termiostty_driver 函数指针未读取函数?,其他缓冲函数, TTY 线路设置, TTY 设备的 proc 和 sysfs 处理, tty_driver 结构详细信息, tty_operations 结构详细信息, tty_struct 结构详细信息,快速参考
drivers, The Role of the Device Driver, The Role of the Device DriverThe Role of the Device Driver, Classes of Devices and Modules, Classes of Devices and Modules, Classes of Devices and Modules, Classes of Devices and Modules, Security Issues, Doing It in User Space, The Design of scull, File Operations, Concurrency and Race Conditions, Choosing the ioctl Commands, Device Control Without ioctl, The Driver's Point of View, A scull Based on the Slab Caches: scullc, A scull Using Whole Pages: scullp, A scull Using Virtual Addresses: scullv, Quick Reference, A Sample Driver, Reusing short for I/O Memory, Installing an Interrupt Handler, Do-it-yourself probing, Implementing a Handler, A Write-Buffering ExampleA Write-Buffering Example, Bus methods, Iterating over devices and drivers, Device Drivers, Driver structure embedding, Driver structure embedding, Add a Driver, Remove a Driver, Buses, Devices, and Drivers, Block Drivers, Initialization in sbull, A Simple request Method, Network Drivers, How snull Is DesignedThe Physical Transport of Packets, Connecting to the KernelUtility Fields, Opening and ClosingOpening and Closing, The Interrupt Handler, Changes in Link State, MAC Address ResolutionNon-Ethernet Headers, Custom ioctl Commands, Statistical Information, MulticastA Typical Implementation, Quick ReferenceQuick Reference, TTY DriversA Small TTY Driver, struct termiosstruct termios, tty_driver Function PointersNo read Function?, Other Buffering Functions, TTY Line Settings, proc and sysfs Handling of TTY Devices, The tty_driver Structure in Detail, The tty_operations Structure in Detail, The tty_struct Structure in Detail, Quick Reference
添加,添加驱动程序
异步通知和驱动程序的观点
属性、驱动结构嵌入
块,块驱动程序(请参阅块驱动程序)
面向命令,无需 ioctl 的设备控制
删除、删除驱动程序
设备、设备驱动程序
文件操作,文件操作
FireWire、设备和模块类别
函数、总线、设备和驱动程序
I2O,设备和模块类别
ioctl 编号,选择 ioctl 命令
迭代,迭代设备和驱动程序
lddbus,总线方法
机制、设备驱动程序的作用设备驱动程序的作用-设备驱动程序的作用scull 的设计
策略与设备驱动程序的角色
与策略分离,设备驱动程序的角色设备驱动程序的角色
模块、设备和模块类别
网络、网络驱动程序snull 的设计方式数据包的物理传输连接到内核–实用程序字段打开和关闭打开和关闭中断处理程序链路状态的变化MAC 地址解析非以太网标头自定义 ioctl 命令统计信息多播典型实现快速参考快速参考
连接到内核,连接到内核实用程序字段
功能,快速参考快速参考
中断处理程序,中断处理程序
ioctl 命令、自定义 ioctl 命令
链接状态(变化),链接状态变化
MAC 地址(解析)、MAC 地址解析非以太网标头
多播,多播典型实现
开幕、开幕和闭幕开幕和闭幕
snull,snull 是如何设计的数据包的物理传输
统计、统计信息
sbull, sbull 中的初始化,一种简单的请求方法
初始化,在sbull中初始化
请求方法,一个简单的请求方法
SCSI,设备和模块类别
scull、并发和竞争条件(请参阅 scull)
scullc(示例),基于 Slab 缓存的 scull:scullc
scullp(示例),使用整个页面的 scull:scullp
scullv(示例),使用虚拟地址的 scull:scullv快速参考
安全问题,安全问题
简短(示例)、示例驱动程序重用 I/O 内存缩写安装中断处理程序自行探测实现处理程序
访问 I/O 内存,重用 I/O 内存的缩写
实现中断处理程序,实现处理程序
安装中断处理程序,安装中断处理程序
探测,自己动手探测
shortprint,一个写缓冲示例一个写缓冲示例
结构(嵌入)、驱动程序结构嵌入
tty、TTY 驱动程序小型 TTY 驱动程序struct termiosstruct termiostty_driver 函数指针未读取函数?,其他缓冲函数, TTY 线路设置, TTY 设备的 proc 和 sysfs 处理, tty_driver 结构详细信息, tty_operations 结构详细信息, tty_struct 结构详细信息,快速参考
缓冲区、其他缓冲功能
TTY 设备的目录、 proc 和 sysfs 处理
功能,快速参考
线路设置、TTY 线路设置
指针、tty_driver 函数指针没有读取函数?
结构 termios,结构 termios结构 termios
tty_driver 结构, tty_driver 结构详细信息
tty_operations 结构,tty_operations 结构详细信息
tty_struct 结构, tty_struct 结构详细信息
用户空间,在用户空间中执行
adding, Add a Driver
asynchronous notification and, The Driver's Point of View
attributes, Driver structure embedding
block, Block Drivers (see block drivers)
command-oriented, Device Control Without ioctl
deleting, Remove a Driver
devices, Device Drivers
file operations, File Operations
FireWire, Classes of Devices and Modules
functions, Buses, Devices, and Drivers
I2O, Classes of Devices and Modules
ioctl numbers for, Choosing the ioctl Commands
iteration, Iterating over devices and drivers
lddbus, Bus methods
mechanism, The Role of the Device Driver, The Role of the Device DriverThe Role of the Device Driver, The Design of scull
policy versus, The Role of the Device Driver
separation from policies, The Role of the Device DriverThe Role of the Device Driver
modules, Classes of Devices and Modules
network, Network Drivers, How snull Is DesignedThe Physical Transport of Packets, Connecting to the KernelUtility Fields, Opening and ClosingOpening and Closing, The Interrupt Handler, Changes in Link State, MAC Address ResolutionNon-Ethernet Headers, Custom ioctl Commands, Statistical Information, MulticastA Typical Implementation, Quick ReferenceQuick Reference
connecting to kernels, Connecting to the KernelUtility Fields
functions, Quick ReferenceQuick Reference
interrupt handlers for, The Interrupt Handler
ioctl commands, Custom ioctl Commands
link state (changes in), Changes in Link State
MAC addresses (resolution of), MAC Address ResolutionNon-Ethernet Headers
multicasting, MulticastA Typical Implementation
opening, Opening and ClosingOpening and Closing
snull, How snull Is DesignedThe Physical Transport of Packets
statistics, Statistical Information
sbull, Initialization in sbull, A Simple request Method
initialization, Initialization in sbull
request method, A Simple request Method
SCSI, Classes of Devices and Modules
scull, Concurrency and Race Conditions (see scull)
scullc (example), A scull Based on the Slab Caches: scullc
scullp (example), A scull Using Whole Pages: scullp
scullv (example), A scull Using Virtual Addresses: scullv, Quick Reference
security issues, Security Issues
short (example), A Sample Driver, Reusing short for I/O Memory, Installing an Interrupt Handler, Do-it-yourself probing, Implementing a Handler
accessing I/O memory, Reusing short for I/O Memory
implementing interrupt handlers, Implementing a Handler
installing interrupt handlers, Installing an Interrupt Handler
probing, Do-it-yourself probing
shortprint, A Write-Buffering ExampleA Write-Buffering Example
structures (embedding), Driver structure embedding
tty, TTY DriversA Small TTY Driver, struct termiosstruct termios, tty_driver Function PointersNo read Function?, Other Buffering Functions, TTY Line Settings, proc and sysfs Handling of TTY Devices, The tty_driver Structure in Detail, The tty_operations Structure in Detail, The tty_struct Structure in Detail, Quick Reference
buffers, Other Buffering Functions
directories, proc and sysfs Handling of TTY Devices
functions, Quick Reference
line settings, TTY Line Settings
pointers, tty_driver Function PointersNo read Function?
struct termios, struct termiosstruct termios
tty_driver structure, The tty_driver Structure in Detail
tty_operations structure, The tty_operations Structure in Detail
tty_struct structure, The tty_struct Structure in Detail
user-space, Doing It in User Space
DRIVER_ATTR宏,驱动结构嵌入
DRIVER_ATTR macro, Driver structure embedding
driver_unregister函数,删除驱动程序
driver_unregister function, Remove a Driver
动态设备,动态设备
dynamic devices, Dynamic Devices
动态探针调试工具,动态探针
Dynamic Probes debugging tool, Dynamic Probes

E

EBUSY 错误,阻止打开作为 EBUSY 的替代方案
EBUSY error, Blocking open as an Alternative to EBUSY
EISA(扩展ISA)、EISA
EISA (Extended ISA), EISA
电梯 (I/O)、请求队列
elevators (I/O), Request Queues
elv_next_request函数,简单的请求方法排队函数命令准备
elv_next_request function, A Simple request Method, Queueing functions, Command Pre-Preparation
嵌入、嵌入 kobjects设备结构嵌入驱动程序结构嵌入
embedding, Embedding kobjects, Device structure embedding, Driver structure embedding
设备结构,设备结构嵌入
驱动结构、驱动结构嵌入
kobjects,嵌入kobjects
device structures, Device structure embedding
driver structures, Driver structure embedding
kobjects, Embedding kobjects
enable_dma 函数,与 DMA 控制器对话
enable_dma function, Talking to the DMA controller
enable_irq 函数,安装共享处理程序
enable_irq function, Installing a Shared Handler
启用、内核中的调试支持内核中的调试支持启用和禁用中断启用 PCI 设备
enabling, Debugging Support in the KernelDebugging Support in the Kernel, Enabling and Disabling Interrupts, Enabling the PCI Device
内核配置,内核中的调试支持内核中的调试支持
中断处理程序,启用和禁用中断
PCI 驱动程序,启用 PCI 设备
configuration for kernels, Debugging Support in the KernelDebugging Support in the Kernel
interrupt handlers, Enabling and Disabling Interrupts
PCI drivers, Enabling the PCI Device
文件结尾、轮询和选择llseek 实现
end-of-file, poll and select, The llseek Implementation
poll 方法 and,poll 和 select
相对于llseek 实现进行查找
poll method and, poll and select
seeking relative to, The llseek Implementation
无限循环,防止系统挂起
endless loops, preventing, System Hangs
端点、端点接口
endpoints, Endpoints, Interfaces
接口,接口
USB、端点
interfaces, Interfaces
USB, Endpoints
熵池和 SA_SAMPLE_RANDOM 标志,安装中断处理程序
entropy pool and SA_SAMPLE_RANDOM flag, Installing an Interrupt Handler
errno.h 头文件,初始化期间的错误处理
errno.h header file, Error Handling During Initialization
初始化期间的错误处理,初始化期间的错误处理
error handling during initialization, Error Handling During Initialization
错误、初始化期间的错误处理初始化期间的错误处理读取和写入Oops 消息指针和错误值
errors, Error Handling During Initialization, Error Handling During Initialization, read and write, Oops Messages, Pointers and Error Values
(另请参阅故障排除)
缓冲区溢出,糟糕消息
代码,初始化期间的错误处理
读/写,读和写
值(指针)、指针和错误值
(see also troubleshooting)
buffer overrun, Oops Messages
codes, Error Handling During Initialization
read/write, read and write
values (pointers), Pointers and Error Values
/etc/networks 文件,分配 IP 号
/etc/networks file, Assigning IP Numbers
/etc/syslog.conf 文件,如何记录消息
/etc/syslog.conf file, How Messages Get Logged
以太网、数据包的物理传输接口信息MAC 地址解析在以太网中使用 ARP在以太网中使用 ARP非以太网标头
Ethernet, The Physical Transport of Packets, Interface Information, MAC Address Resolution, Using ARP with Ethernet, Using ARP with Ethernet, Non-Ethernet Headers
地址解析、MAC地址解析
ARP 以及,将 ARP 与以太网结合使用,将 ARP 与以太网结合使用
非以太网标头,非以太网标头
非以太网接口,接口信息
snull 接口,数据包的物理传输
address resolution, MAC Address Resolution
ARP and, Using ARP with Ethernet, Using ARP with Ethernet
non-Ethernet headers, Non-Ethernet Headers
non-Ethernet interfaces, Interface Information
snull interfaces, The Physical Transport of Packets
ether_setup 函数,初始化每个设备接口信息设备方法
ether_setup function, Initializing Each Device, Interface InformationThe Device Methods
Ethtool,Ethtool 支持
Ethtool, Ethtool Support
ETH_ALEN 宏,打开和关闭
ETH_ALEN macro, Opening and Closing
eth_header 方法,设备方法
eth_header method, The Device Methods
事件、scull 中的陷阱热插拔事件生成
events, Pitfalls in scull, Hotplug Event Generation
hotplug,热插拔事件生成
比赛条件,双桨的陷阱
hotplug, Hotplug Event Generation
race conditions, Pitfalls in scull
独家等待,独家等待
exclusive waits, Exclusive waits
执行、用户空间和内核空间用户空间和内核空间并发及其管理延迟执行-短延迟内核计时器延迟运行处理程序
execution, User Space and Kernel Space, User Space and Kernel Space, Concurrency and Its Management, Delaying ExecutionShort Delays, Kernel Timers, Delays, Running the Handler
异步(中断模式),内核定时器
代码(延迟)、延迟执行短延迟延迟
模式、用户空间和内核空间用户空间和内核空间
共享中断处理程序,运行处理程序
线程、并发及其管理
asynchronous (interrupt mode), Kernel Timers
of code (delaying), Delaying ExecutionShort Delays, Delays
modes, User Space and Kernel Space, User Space and Kernel Space
shared interrupt handlers, Running the Handler
threads, Concurrency and Its Management
实验内核,版本编号
experimental kernels, Version Numbering
导出符号,内核符号表内核符号表
exporting symbols, The Kernel Symbol TableThe Kernel Symbol Table
EXPORT_SYMBOL 宏,初始化和关闭快速参考
EXPORT_SYMBOL macro, Initialization and Shutdown, Quick Reference
EXPORT_SYMBOL_GPL 宏,快速参考
EXPORT_SYMBOL_GPL macro, Quick Reference
扩展总线、外部总线
extended buses, External Buses
扩展 ISA (EISA)、EISA
Extended ISA (EISA), EISA

F

F

快速中断处理程序、快速和慢速处理程序
fast interrupt handlers, Fast and Slow Handlers
FASYNC 标志、文件操作异步通知
FASYNC flag, File Operations, Asynchronous Notification
fasync 方法,文件操作
fasync method, File Operations
fasync_helper 函数,驱动程序的观点快速参考
fasync_helper function, The Driver's Point of View, Quick Reference
fasync_struct 结构,驱动程序的观点
fasync_struct structure, The Driver's Point of View
故障、内核模块与应用程序调试系统故障系统挂起
faults, Kernel Modules Versus Applications, Debugging System FaultsSystem Hangs
故障模块(oops 消息)、Oops 消息
faulty module (oops messages), Oops Messages
failedy_read 函数,哎呀消息
faulty_read function, Oops Messages
failedy_write 函数,哎呀消息
faulty_write function, Oops Messages
fcntl 系统调用、预定义命令异步通知
fcntl system call, The Predefined Commands, Asynchronous Notification
fcntl.h 头文件,阻塞和非阻塞操作
fcntl.h header file, Blocking and Nonblocking Operations
fc_setup函数,接口信息
fc_setup function, Interface Information
fdatasync 系统调用,刷新待处理输出
fdatasync system call, Flushing pending output
FDDI 网络、配置接口、接口信息
FDDI networks, configuring interfaces, Interface Information
fddi_setup函数,接口信息
fddi_setup function, Interface Information
光纤通道设备、初始化、接口信息
fiber channel devices, initializing, Interface Information
FIFO(先进先出)设备、scull 的设计scull 的设计poll 和 select
FIFO (first-in-first-out) devices, The Design of scull, The Design of scull, poll and select
poll 方法 and,poll 和 select
poll method and, poll and select
文件系统标头 (fs.h),快速参考
File System header (fs.h), Quick Reference
文件、初始化和关闭主编号和次编号文件操作文件操作文件操作文件结构文件结构、文件结构、文件结构inode 结构如何记录消息在 /proc 中实现文件,功能和受限操作,轮询和选择,设备文件上的访问控制,快速参考,快速参考,快速参考,快速参考, /proc 接口, /proc 接口,分配 IP 号,接口信息
files, Initialization and Shutdown, Major and Minor Numbers, File Operations, File OperationsFile Operations, The file Structure, The file Structure, The file Structure, The file Structure, The inode Structure, How Messages Get Logged, Implementing files in /proc, Capabilities and Restricted Operations, poll and select, Access Control on a Device File, Quick Reference, Quick Reference, Quick Reference, Quick Reference, The /proc Interface, The /proc Interface, Assigning IP Numbers, Interface Information
/etc/networks[文件,分配 IP 号
etc/networks,分配 IP 号
访问,设备文件的访问控制
Capability.h 头文件,功能和限制操作快速参考
设备、主要和次要编号
标志,文件结构
索引节点结构,索引节点结构
中断,/proc 接口
读写控制。头文件,快速参考
kmsg,如何记录消息
ksyms,初始化和关闭
模式,文件结构
net_int c,接口信息
打开,文件结构
操作,文件操作文件操作
poll.h 头文件,poll 和 select快速参考
/proc,在/proc中实现文件
stat,/proc 接口
结构,文件结构
结构体、文件操作
uaccess.h 头文件,快速参考
/etc/networks[files, Assigning IP Numbers
etc/networks, Assigning IP Numbers
access to, Access Control on a Device File
capability.h header file, Capabilities and Restricted Operations, Quick Reference
devices, Major and Minor Numbers
flags, The file Structure
inode structure, The inode Structure
interrupts, The /proc Interface
ioctl. header file, Quick Reference
kmsg, How Messages Get Logged
ksyms, Initialization and Shutdown
modes, The file Structure
net_int c, Interface Information
open, The file Structure
operations, File OperationsFile Operations
poll.h header file, poll and select, Quick Reference
/proc, Implementing files in /proc
stat, The /proc Interface
structure, The file Structure
structures, File Operations
uaccess.h header file, Quick Reference
文件系统、分割内核分割内核设备和模块的类设备和模块的类设备和模块的类主编号和次编号主编号的动态分配创建 /proc 文件seq_file接口/proc 接口/proc 接口和共享中断Sysfs 操作
filesystems, Splitting the Kernel, Splitting the Kernel, Classes of Devices and Modules, Classes of Devices and Modules, Classes of Devices and Modules, Major and Minor NumbersDynamic Allocation of Major Numbers, Creating your /proc fileThe seq_file interface, The /proc Interface, The /proc Interface and Shared Interrupts, Sysfs Operations
字符驱动程序,主要号码和次要号码-主要号码的动态分配
模块、设备和模块类别设备和模块类别
节点、拆分内核设备和模块类
/proc,创建 /proc 文件seq_file 接口/proc 接口/proc 接口和共享中断
安装中断处理程序,/proc 接口
共享中断以及/proc 接口和共享中断
sysfs, Sysfs 操作
char drivers, Major and Minor NumbersDynamic Allocation of Major Numbers
modules, Classes of Devices and Modules, Classes of Devices and Modules
nodes, Splitting the Kernel, Classes of Devices and Modules
/proc, Creating your /proc fileThe seq_file interface, The /proc Interface, The /proc Interface and Shared Interrupts
installing interrupt handlers, The /proc Interface
shared interrupts and, The /proc Interface and Shared Interrupts
sysfs, Sysfs Operations
file_operations 结构、文件操作文件操作文件结构mmap 设备操作
file_operations structure, File Operations, File Operations, The file Structure, The mmap Device Operation
声明使用标记初始化、文件操作
mmap 方法和mmap 设备操作
declaring using tagged initialization, File Operations
mmap method and, The mmap Device Operation
filp指针、文件结构读写ioctl
filp pointer, The file Structure, read and write, ioctl
在ioctl方法中,ioctl
在读/写方法中,读和写
in ioctl method, ioctl
in read/write methods, read and write
filp-\\>f_op,文件结构
filp-\\>f_op, The file Structure
过滤器热插拔操作,热插拔操作
filter hotplug operation, Hotplug Operations
细粒度锁定,细粒度锁定与粗粒度锁定
fine-grained locking, Fine- Versus Coarse-Grained Locking
FIOASYNC 命令,预定义命令
FIOASYNC command, The Predefined Commands
FIOCLEX 命令,预定义命令
FIOCLEX command, The Predefined Commands
FIONBIO 命令,预定义命令
FIONBIO command, The Predefined Commands
FIONCLEX 命令,预定义命令
FIONCLEX command, The Predefined Commands
FIOQSIZE 命令,预定义命令
FIOQSIZE command, The Predefined Commands
FireWire、设备和模块类别IEEE1394 (FireWire)
FireWire, Classes of Devices and Modules, IEEE1394 (FireWire)
驱动程序、设备类和模块
drivers, Classes of Devices and Modules
固件、启动时间处理固件工作原理内核固件接口工作原理固件
firmware, Boot Time, Dealing with FirmwareHow It Works, The Kernel Firmware Interface, How It Works, Firmware
呼叫,如何运作
功能、固件
接口,内核固件接口
Linux 设备模型,处理固件工作原理
PCI启动时间配置,启动时间
calling, How It Works
functions, Firmware
interfaces, The Kernel Firmware Interface
Linux device model, Dealing with FirmwareHow It Works
PCI boot-time configuration, Boot Time
flags,文件结构,轮询和选择,轮询和选择,轮询和选择,轮询和选择,轮询和选择,轮询和选择,轮询和选择,轮询和选择,轮询和选择,从设备读取数据,异步通知,标志参数,标志参数,标志参数,标志参数,标志参数,标志参数,标志参数,标志参数,标志参数,标志参数,标志参数,标志参数,标志参数,标志参数,标志参数,标志参数 , 后备缓存,后备缓存,后备缓存,后备缓存缓存Lookaside 缓存get_free_page 和朋友get_free_page 和朋友快速参考安装中断处理程序,安装中断处理程序,安装中断处理程序,安装共享处理程序,快速参考,接口特定类型,访问 I/O 和内存空间,内存映射和结构页, vm_area_struct 结构,支持可移动媒体初始化每个设备接口信息接口信息接口信息接口信息接口信息接口信息,接口信息,接口信息,接口信息 , 接口信息,接口信息,接口信息,接口信息,接口信息,接口信息,接口信息,接口信息,重要字段, struct termios , struct termios , struct termios
flags, The file Structure, poll and select, poll and select, poll and select, poll and select, poll and select, poll and select, poll and select, poll and select, poll and select, Reading data from the device, Asynchronous Notification, The Flags Argument, The Flags Argument, The Flags Argument, The Flags Argument, The Flags Argument, The Flags Argument, The Flags Argument, The Flags Argument, The Flags Argument, The Flags Argument, The Flags Argument, The Flags Argument, The Flags Argument, The Flags Argument, The Flags Argument, The Flags Argument, Lookaside Caches, Lookaside Caches, Lookaside Caches, Lookaside Caches, Lookaside Caches, get_free_page and Friends, get_free_page and Friends, Quick Reference, Installing an Interrupt Handler, Installing an Interrupt Handler, Installing an Interrupt Handler, Installing a Shared Handler, Quick Reference, Interface-Specific Types, Accessing the I/O and Memory Spaces, The Memory Map and Struct Page, The vm_area_struct structure, Supporting Removable Media, Initializing Each Device, Interface Information, Interface Information, Interface Information, Interface Information, Interface Information, Interface Information, Interface Information, Interface Information, Interface Information, Interface Information, Interface Information, Interface Information, Interface Information, Interface Information, Interface Information, Interface Information, Interface Information, The Important Fields, struct termios, struct termios, struct termios
争论,旗帜争论
FASYNC,异步通知
文件,文件结构
GFP_ATOMIC、标志参数get_free_page 和朋友
GFP_COLD,标志参数
GFP_DMA,标志参数
GFP_HIGH,标志参数
GFP_HIGHMEM,标志参数
GFP_HIGHUSER,标志参数
GFP_KERNEL、get_free_page 和朋友
GFP_NOFAIL,标志参数
GFP_NOFS,标志参数
GFP_NOIO,标志参数
GFP_NORETRY,标志参数
GFP_NOWARN,标志参数
GFP_REPEAT,标志参数
GFP_USER,标志参数
GTP_KERNEL,标志参数
IFF_ALLMULTI,接口信息
IFF_AUTOMEDIA,接口信息
IFF_BROADCAST,接口信息
IFF_DEBUG,接口信息
IFF_DYNAMIC,接口信息
IFF_LOOPBACK,接口信息
IFF_MASTER,接口信息
IFF_MULTICAST,接口信息
IFF_NOARP,初始化各个设备接口信息
IFF_NOTRAILERS,接口信息
IFF_POINTTOPOINT,接口信息
IFF_PORTSEL,接口信息
IFF_PROMISC,接口信息
IFF_RUNNING,接口信息
IFF_SLAVE,接口信息
IFF_UP,接口信息
media_change,支持可移动媒体
内存分配、标志参数旁视缓存快速参考
对于net_device结构,接口信息
O_NONBLOCK(f_flags字段),从设备读取数据
PACKET_HOST,重要字段
PG_locked,内存映射和结构页
POLLERR,轮询并选择
POLLHUP,轮询并选择
POLLIN,轮询并选择
POLLOUT,轮询并选择
POLLPRI,轮询并选择
POLLRDBAND,轮询并选择
POLLRDNORM,轮询并选择
POLLWRBAND,轮询并选择
POLLWRNORM,轮询并选择
资源 (PCI),访问 I/O 和内存空间
SA_INTERRUPT,安装中断处理程序快速参考
SA_SAMPLE_RANDOM,安装中断处理程序
SA_SHIRQ,安装中断处理程序安装共享处理程序
SLAB_CACHE_DMA,后备缓存
SLAB_CTOR_CONSTRUCTOR,后备缓存
SLAB_HWCACHE_ALIGN,后备缓存
SLAB_NO_REAP,后备缓存
TTY_DRIVER_NO_DEVFS,结构 termios
TTY_DRIVER_REAL_RAW,结构 termios
TTY_DRIVER_RESET_TERMIOS,结构 termios
VM_IO,vm_area_struct结构
墙壁、接口特定类型
argument, The Flags Argument
FASYNC, Asynchronous Notification
file, The file Structure
GFP_ATOMIC, The Flags Argument, get_free_page and Friends
GFP_COLD, The Flags Argument
GFP_DMA, The Flags Argument
GFP_HIGH, The Flags Argument
GFP_HIGHMEM, The Flags Argument
GFP_HIGHUSER, The Flags Argument
GFP_KERNEL, get_free_page and Friends
GFP_NOFAIL, The Flags Argument
GFP_NOFS, The Flags Argument
GFP_NOIO, The Flags Argument
GFP_NORETRY, The Flags Argument
GFP_NOWARN, The Flags Argument
GFP_REPEAT, The Flags Argument
GFP_USER, The Flags Argument
GTP_KERNEL, The Flags Argument
IFF_ALLMULTI, Interface Information
IFF_AUTOMEDIA, Interface Information
IFF_BROADCAST, Interface Information
IFF_DEBUG, Interface Information
IFF_DYNAMIC, Interface Information
IFF_LOOPBACK, Interface Information
IFF_MASTER, Interface Information
IFF_MULTICAST, Interface Information
IFF_NOARP, Initializing Each Device, Interface Information
IFF_NOTRAILERS, Interface Information
IFF_POINTTOPOINT, Interface Information
IFF_PORTSEL, Interface Information
IFF_PROMISC, Interface Information
IFF_RUNNING, Interface Information
IFF_SLAVE, Interface Information
IFF_UP, Interface Information
media_change, Supporting Removable Media
memory allocation, The Flags Argument, Lookaside Caches, Quick Reference
for net_device structure, Interface Information
O_NONBLOCK (f_flags field), Reading data from the device
PACKET_HOST, The Important Fields
PG_locked, The Memory Map and Struct Page
POLLERR, poll and select
POLLHUP, poll and select
POLLIN, poll and select
POLLOUT, poll and select
POLLPRI, poll and select
POLLRDBAND, poll and select
POLLRDNORM, poll and select
POLLWRBAND, poll and select
POLLWRNORM, poll and select
resource (PCI), Accessing the I/O and Memory Spaces
SA_INTERRUPT, Installing an Interrupt Handler, Quick Reference
SA_SAMPLE_RANDOM, Installing an Interrupt Handler
SA_SHIRQ, Installing an Interrupt Handler, Installing a Shared Handler
SLAB_CACHE_DMA, Lookaside Caches
SLAB_CTOR_CONSTRUCTOR, Lookaside Caches
SLAB_HWCACHE_ALIGN, Lookaside Caches
SLAB_NO_REAP, Lookaside Caches
TTY_DRIVER_NO_DEVFS, struct termios
TTY_DRIVER_REAL_RAW, struct termios
TTY_DRIVER_RESET_TERMIOS, struct termios
VM_IO, The vm_area_struct structure
Wall, Interface-Specific Types
翻转(tty 驱动程序),没有读取功能?
flips (tty drivers), No read Function?
数据流(tty 驱动程序)、数据流
flow of data (tty drivers), Flow of Data
lush方法、文件操作release方法
flush method, File Operations, The release Method
close系统调用和release方法
close system call and, The release Method
刷新操作、文件操作
flush operation, File Operations
刷新挂起输出,刷新挂起输出,刷新挂起输出
flushing pending output, Flushing pending output, Flushing pending output
字体(在控制台上不正确),没有 ioctl 的设备控制
fonts (incorrect on console), Device Control Without ioctl
fops 指针,文件操作
fops pointers, File Operations
形式(BCD),驱动程序支持哪些设备?
forms (BCD), What Devices Does the Driver Support?
分片,自己动手分配,自己动手分配
fragmentation, Do-it-yourself allocation, Do-it-yourself allocation
自由命令,玩转新设备
free command, Playing with the New Devices
释放、分配和释放设备号Linux 信号量实现DMA 池gendisk 结构作用于套接字缓冲区的函数
freeing, Allocating and Freeing Device Numbers, The Linux Semaphore Implementation, DMA pools, The gendisk structure, Functions Acting on Socket Buffers
缓冲区,作用于套接字缓冲区的函数
设备号、分配和释放设备号
磁盘,gendisk 结构
DMA 池、DMA 池
信号量,Linux 信号量实现
buffers, Functions Acting on Socket Buffers
device numbers, Allocating and Freeing Device Numbers
disks, The gendisk structure
DMA pools, DMA pools
semaphores, The Linux Semaphore Implementation
free_dma函数,注册DMA使用
free_dma function, Registering DMA usage
free_irq 函数,安装共享处理程序
free_irq function, Installing a Shared Handler
free_netdev函数,模块卸载
free_netdev functions, Module Unloading
free_pages 函数、get_free_page 和 Friends
free_pages function, get_free_page and Friends
fs.h 头文件、快速参考阻塞和非阻塞操作驱动程序的观点快速参考
fs.h header file, Quick Reference, Blocking and Nonblocking Operations, The Driver's Point of View, Quick Reference
异步通知和驱动程序的观点
阻塞/非阻塞操作、阻塞和非阻塞操作
asynchronous notification and, The Driver's Point of View
blocking/nonblocking operations, Blocking and Nonblocking Operations
fsync 方法、文件操作刷新待处理输出
fsync method, File Operations, Flushing pending output
全类接口,全类接口
full class interfaces, The Full Class Interface
函数、Hello World 模块Hello World 模块内核模块与应用程序一些其他细节初始化和关闭-模块加载竞赛初始化和关闭清理函数open 方法- open 方法release 方法, scull 的内存使用情况, printk打印设备编号, printk ,如何记录消息,如何记录消息,打开和关闭消息在 /proc 中实现文件创建 /proc 文件创建 /proc 文件seq_file 接口seq_file 接口seq_file 接口seq_file 接口seq_file 接口seq_file 接口哎呀消息哎呀消息系统挂起系统挂起系统挂起信号量和互斥体Linux 信号量实现Linux 信号量实现自旋锁函数不明确的规则使用 ioctl 参数使用 ioctl 参数使用 ioctl 参数功能和受限操作简单休眠手动休眠独占等待古代历史: sleep_on轮询和选择驱动程序的观点,驱动程序的观点,在打开时克隆设备,快速参考,快速参考,快速参考,快速参考, 快速参考,快速参考,快速参考,快速参考,快速参考,快速参考,快速参考, 快速参考,特定于处理器的寄存器,特定于处理器的寄存器,了解当前时间,了解当前时间,产生处理器超时超时内核定时器内核定时器定时器 API定时器 API内核定时器的实现TaskletsTaskletsTaskletsTaskletsTaskletsTaskletsWorkqueueskmalloc 的真实故事大小参数标志参数Lookaside 缓存Lookaside 缓存Lookaside 缓存Lookaside 缓存get_free_page 和 Friendsget_free_page 和 Friendsget_free_page 和朋友, get_free_page 和朋友, get_free_page 和朋友, get_free_page 和朋友, vmalloc 和朋友使用虚拟地址的 scull: scullv , vmalloc 和朋友, vmalloc 和朋友, vmalloc 和朋友, vmalloc 和朋友,快速参考, I/O寄存器和传统存储器, I/O 寄存器和传统存储器, I/O 寄存器和传统存储器, I/O 寄存器和传统存储器,I/O 寄存器和传统存储器, I/O 寄存器和传统存储器, I/O 寄存器和传统存储器, I/O 寄存器和传统存储器, I/O 寄存器和传统存储器, I/O 寄存器和传统存储器, I/ O 寄存器和传统内存,操作 I/O 端口,操作 I/O 端口,操作 I/O 端口,操作 I/O 端口,操作 I/O 端口,操作 I/O 端口,字符串操作,字符串操作,字符串操作,字符串操作,字符串操作,字符串操作,暂停 I/O ,暂停 I/O ,使用 I/O 内存, I/O 内存分配和映射, I/O 内存分配和映射, isa_readb 和朋友,快速参考,安装中断处理程序,安装中断处理程序,安装中断处理程序,安装中断处理程序,安装中断处理程序,内核辅助探测,内核辅助探测, x86 上中断处理的内部结构, x86 上中断处理的内部结构, Tasklets ,安装共享处理程序,安装共享处理程序,安装共享处理程序,数据对齐,数据对齐,链接列表,链接列表,链接列表,链接列表,链接列表,链接列表,启用 PCI 设备,访问 I/O 和内存空间,访问 I/O 和内存空间访问 I/O 和内存空间访问 I/O 和内存空间创建和销毁 Urbs中断 urbs 、批量 urbs、控制urbs提交 Urbs完成 Urbs:完成回调处理程序完成 Urbs:完成回调处理程序取消 Urbs取消 Urbs注册 USB 驱动程序注册 USB 驱动程序注册 USB 驱动程序注册 USB 驱动程序注册 USB 驱动程序注册 USB 驱动程序详细探测和断开连接详细探测和断开连接、详细探测和断开连接、详细探测和断开连接usb_bulk_msgusb_control_msg其他 USB 数据函数其他 USB 数据函数释放函数和 kobject 类型,总线注册,总线方法,迭代设备和驱动程序,驱动程序结构嵌入,删除设备,添加驱动程序删除驱动程序udevudevudevudev内核固件接口Sysfs 操作总线、设备和驱动程序总线、设备和驱动程序总线、设备和驱动程序固件内存映射和结构页内存映射和结构页内存映射和结构页使用 remap_pfn_range使用 nopage 映射内存执行直接 I/O设置相干 DMA 映射,分散/聚集映射,注册 DMA 使用,注册 DMA 使用, 与DMA 控制器对话,与 DMA 控制器对话,与DMA 控制器对话,与 DMA 控制器对话,与 DMA 控制器对话,对话与 DMA 控制器对话,与 DMA 控制器对话,与 DMA 控制器对话,与 DMA 控制器对话,直接内存访问,直接内存访问,块驱动程序注册, sbull 中的初始化,关于扇区大小的注释,请求处理不使用请求队列,一个简单的请求方法,队列创建和删除,排队函数 ,排队函数,排队函数,队列控制函数,队列控制函数,队列控制函数,队列控制功能,队列控制功能,队列控制功能,队列控制功能,队列控制功能,队列控制函数,队列控制函数,命令准备,快速参考快速参考,初始化每个设备,初始化每个设备,初始化每个设备,模块卸载,模块卸载,接口信息设备方法,接口信息,接口信息,接口信息,接口信息,实用程序字段,打开和关闭,打开和关闭,控制传输并发,控制传输并发,中断处理程序,链接状态的变化, 链接状态的变化,链接状态的变化,作用于套接字缓冲区的函数,作用于套接字缓冲区的函数,作用于套接字缓冲区,作用于套接字缓冲区的函数,作用于套接字缓冲区的函数,作用于套接字缓冲区的函数,作用于套接字缓冲区的函数,作用于套接字缓冲区的函数,作用于套接字缓冲区的函数,作用于套接字缓冲区的函数 , 作用于套接字缓冲区的函数,作用于套接字缓冲区的函数,作用于套接字缓冲区的函数,作用于套接字缓冲区的函数,典型实现,快速参考快速参考小型 TTY 驱动程序小型 TTY 驱动程序tty_driver 函数指针未读取函数?,打开和关闭打开和关闭,open 和 close打开和关闭打开和关闭打开和关闭数据流其他缓冲功能无读取功能?set_termiosset_termiostiocmget 和 tiocmsettiocmget 和 tiocmsetioctl快速参考
functions, The Hello World Module, The Hello World Module, Kernel Modules Versus Applications, A Few Other Details, Initialization and ShutdownModule-Loading Races, Initialization and Shutdown, The Cleanup Function, The open MethodThe open Method, The release Method, scull's Memory Usage, printkPrinting Device Numbers, printk, How Messages Get Logged, How Messages Get Logged, Turning the Messages On and Off, Implementing files in /proc, Creating your /proc file, Creating your /proc file, The seq_file interface, The seq_file interface, The seq_file interface, The seq_file interface, The seq_file interface, The seq_file interface, Oops Messages, Oops Messages, System Hangs, System Hangs, System Hangs, Semaphores and Mutexes, The Linux Semaphore Implementation, The Linux Semaphore Implementation, The Spinlock Functions, Ambiguous Rules, Using the ioctl Argument, Using the ioctl Argument, Using the ioctl Argument, Capabilities and Restricted Operations, Simple Sleeping, Manual sleeps, Exclusive waits, Ancient history: sleep_on, poll and select, The Driver's Point of View, The Driver's Point of View, Cloning the Device on open, Quick Reference, Quick Reference, Quick Reference, Quick Reference, Quick Reference, Quick Reference, Quick Reference, Quick Reference, Quick Reference, Quick Reference, Quick Reference, Quick Reference, Processor-Specific Registers, Processor-Specific Registers, Knowing the Current Time, Knowing the Current Time, Yielding the processor, Timeouts, Timeouts, Kernel Timers, Kernel Timers, The Timer API, The Timer API, The Implementation of Kernel Timers, Tasklets, Tasklets, Tasklets, Tasklets, Tasklets, Tasklets, Workqueues, The Real Story of kmallocThe Size Argument, The Flags Argument, Lookaside Caches, Lookaside Caches, Lookaside Caches, Lookaside Caches, get_free_page and Friends, get_free_page and Friends, get_free_page and Friends, get_free_page and Friends, get_free_page and Friends, get_free_page and Friends, vmalloc and FriendsA scull Using Virtual Addresses: scullv, vmalloc and Friends, vmalloc and Friends, vmalloc and Friends, vmalloc and Friends, Quick Reference, I/O Registers and Conventional Memory, I/O Registers and Conventional Memory, I/O Registers and Conventional Memory, I/O Registers and Conventional Memory, I/O Registers and Conventional Memory, I/O Registers and Conventional Memory, I/O Registers and Conventional Memory, I/O Registers and Conventional Memory, I/O Registers and Conventional Memory, I/O Registers and Conventional Memory, I/O Registers and Conventional Memory, Manipulating I/O ports, Manipulating I/O ports, Manipulating I/O ports, Manipulating I/O ports, Manipulating I/O ports, Manipulating I/O ports, String Operations, String Operations, String Operations, String Operations, String Operations, String Operations, Pausing I/O, Pausing I/O, Using I/O Memory, I/O Memory Allocation and Mapping, I/O Memory Allocation and Mapping, isa_readb and Friends, Quick Reference, Installing an Interrupt Handler, Installing an Interrupt Handler, Installing an Interrupt Handler, Installing an Interrupt Handler, Installing an Interrupt Handler, Kernel-assisted probing, Kernel-assisted probing, The internals of interrupt handling on the x86, The internals of interrupt handling on the x86, Tasklets, Installing a Shared Handler, Installing a Shared Handler, Installing a Shared Handler, Data Alignment, Data Alignment, Linked Lists, Linked Lists, Linked Lists, Linked Lists, Linked Lists, Linked Lists, Enabling the PCI Device, Accessing the I/O and Memory Spaces, Accessing the I/O and Memory Spaces, Accessing the I/O and Memory Spaces, Accessing the I/O and Memory Spaces, Creating and Destroying Urbs, Interrupt urbs, Bulk urbs, Control urbs, Submitting Urbs, Completing Urbs: The Completion Callback Handler, Completing Urbs: The Completion Callback Handler, Canceling Urbs, Canceling Urbs, Registering a USB Driver, Registering a USB Driver, Registering a USB Driver, Registering a USB Driver, Registering a USB Driver, Registering a USB Driver, probe and disconnect in Detail, probe and disconnect in Detail, probe and disconnect in Detail, probe and disconnect in Detail, usb_bulk_msg, usb_control_msg, Other USB Data Functions, Other USB Data Functions, Release functions and kobject types, Bus registration, Bus methods, Iterating over devices and drivers, Driver structure embedding, Remove a Device, Add a Driver, Remove a Driver, udev, udev, udev, udev, The Kernel Firmware Interface, Sysfs Operations, Buses, Devices, and Drivers, Buses, Devices, and Drivers, Buses, Devices, and Drivers, Classes, Firmware, The Memory Map and Struct Page, The Memory Map and Struct Page, The Memory Map and Struct Page, Using remap_pfn_range, Mapping Memory with nopage, Performing Direct I/O, Setting up coherent DMA mappings, Scatter/gather mappings, Registering DMA usage, Registering DMA usage, Talking to the DMA controller, Talking to the DMA controller, Talking to the DMA controller, Talking to the DMA controller, Talking to the DMA controller, Talking to the DMA controller, Talking to the DMA controller, Talking to the DMA controller, Talking to the DMA controller, Direct Memory Access, Direct Memory Access, Block Driver Registration, Initialization in sbull, A Note on Sector Sizes, Request ProcessingDoing without a request queue, A Simple request Method, Queue creation and deletion, Queueing functions, Queueing functions, Queueing functions, Queue control functions, Queue control functions, Queue control functions, Queue control functions, Queue control functions, Queue control functions, Queue control functions, Queue control functions, Queue control functions, Queue control functions, Command Pre-Preparation, Quick ReferenceQuick Reference, Initializing Each Device, Initializing Each Device, Initializing Each Device, Module Unloading, Module Unloading, Interface InformationThe Device Methods, Interface Information, Interface Information, Interface Information, Interface Information, Utility Fields, Opening and Closing, Opening and Closing, Controlling Transmission Concurrency, Controlling Transmission Concurrency, The Interrupt Handler, Changes in Link State, Changes in Link State, Changes in Link State, Functions Acting on Socket Buffers, Functions Acting on Socket Buffers, Functions Acting on Socket Buffers, Functions Acting on Socket Buffers, Functions Acting on Socket Buffers, Functions Acting on Socket Buffers, Functions Acting on Socket Buffers, Functions Acting on Socket Buffers, Functions Acting on Socket Buffers, Functions Acting on Socket Buffers, Functions Acting on Socket Buffers, Functions Acting on Socket Buffers, Functions Acting on Socket Buffers, Functions Acting on Socket Buffers, A Typical Implementation, Quick ReferenceQuick Reference, A Small TTY Driver, A Small TTY Driver, tty_driver Function PointersNo read Function?, open and closeopen and close, open and closeopen and close, open and close, open and close, Flow of Data, Other Buffering Functions, No read Function?, set_termios, set_termios, tiocmget and tiocmset, tiocmget and tiocmset, ioctls, Quick Reference
access_ok,使用 ioctl 参数
alloc_netdev,初始化每个设备
alloc_skb,作用于套接字缓冲区的函数
alloc_tty_driver,小型 TTY 驱动程序
blkdev_dequeue_request,排队函数
blk_cleanup_queue,队列创建和删除
blk_queue_hardsect_size,关于扇区大小的注释
blk_queue_segment_boundary,队列控制函数
块驱动程序,快速参考快速参考
总线、总线、设备和驱动程序
bus_add_driver,添加驱动程序
bus_for_each_dev,迭代设备和驱动程序
bus_register,总线注册
从模块/应用程序调用,内核模块与应用程序
有能力、能力和受限操作快速参考
chars_in_buffer,其他缓冲函数
Claim_dma_lock,与 DMA 控制器对话
班级,班级
类_简单_创建,udev
类简单设备添加,udev
class_simple_device_remove,udev
cleanup,清理函数
clear_dma_ff,与 DMA 控制器对话
close (tty 驱动程序), open and close打开和关闭
Complete (urbs),正在完成 Urbs:完成回调处理程序
const char *dev_name,安装中断处理程序
const char *name,注册 USB 驱动程序
const struct usb_device_id*id_table,注册USB驱动
构造函数 (kmem_cache_create),后备缓存
create_proc_read_entry,创建 /proc 文件
del_timer_sync,计时器 API
设备、总线、设备和驱动程序
dev_alloc_skb,作用于套接字缓冲区的函数
dev_kfree_skb,中断处理程序作用于套接字缓冲区的函数
disable_dma,与 DMA 控制器对话
disable_irq,安装共享处理程序
断开连接 (USB)、注册 USB 驱动程序探测和断开连接详细信息
dma_free_coherent,设置相干 DMA 映射
双下划线 (__),一些其他细节
下来,Linux 信号量实现
do_close,打开和关闭
do_gettimeofday,了解当前时间
do_IRQ,x86 上中断处理的内部结构
驱动程序、总线、设备和驱动程序
driver_unregister,删除驱动程序
elv_next_request,一个简单的请求方法排队函数命令前期准备
enable_dma,与 DMA 控制器对话
enable_irq,安装共享处理程序
ether_setup,初始化每个设备接口信息设备方法
fasync_helper,驱动程序的观点快速参考
错误读取,哎呀消息
错误写入,哎呀消息
fc_setup,接口信息
fddi_setup,接口信息
固件,固件
free_dma,注册 DMA 使用
free_irq,安装共享处理程序
free_netdev,模块卸载
free_pages、get_free_page 和好友
get_cycles,处理器特定寄存器
get_dma_residue,与 DMA 控制器对话
get_fast_time,了解当前时间
get_free_page、get_free_page 和好友
get_free_pages、标志参数get_free_page 和朋友 vmalloc 和朋友
get_page,用nopage映射内存
get_unaligned,数据对齐
get_user,使用 ioctl 参数快速参考
get_user_pages,执行直接 I/O
get_zeroed_pa​​ge、get_free_page 和朋友
handle_IRQ_event,x86 上中断处理的内部机制
你好世界模块,你好世界模块
hippi_setup,接口信息
inb,操作 I/O 端口
inb_p,暂停 I/O
初始化、初始化和关闭——模块加载竞赛
inl,操作 I/O 端口
insb,字符串操作
插入计划,系统挂起
insl,字符串操作
insw,字符串操作
int (USB),注册USB驱动程序
int pci_enable_device,启用 PCI 设备
int seq_escape,seq_file 接口
int seq_path,seq_file 接口
int seq_printf,seq_file 接口
int seq_putc,seq_file 接口
int seq_puts,seq_file 接口
inw,操作 I/O 端口
in_atomic,内核定时器
in_interrupt,内核定时器
ioctl(tty 驱动程序)、ioctl
ioremap、vmalloc 和朋友使用 I/O 内存快速参考
ioremap_nocache,I/O 内存分配和映射
iounmap、vmalloc 和朋友I/O 内存分配和映射
irqreturn_t,安装中断处理程序
isa_readb、isa_readb 和朋友
kfree_skb,作用于套接字缓冲区的函数
Kill_fasync,驱动程序的观点快速参考
kmalloc、scull 的内存使用情况kmalloc 的真实故事大小参数get_free_page 和朋友
分配引擎,kmalloc 的真实故事大小争论
性能下降问题、get_free_page 和 Friends
kmap,内存映射和结构页
kmap_skb_frag,作用于套接字缓冲区的函数
kmem_cache_alloc,后备缓存
kmem_cache_create,后备缓存
kmem_cache_t 类型,后备缓存
list_add,链接列表
list_add_tail,链接列表
list_del,链接列表
list_empty,链接列表
list_move,链接列表
list_splice,链表
锁定、不明确的规则
匹配(总线),总线方法
module_init、初始化和关闭
mod_timer,定时器 API内核定时器的实现
netif_carrier_off,链路状态的变化
netif_carrier_ok,链路状态的变化
netif_carrier_on,链路状态的变化
netif_start_queue,打开和关闭
netif_stop_queue、开启和关闭控制传输并发
netif_wake_queue,控制传输并发
网络驱动程序,快速参考快速参考
open (tty 驱动程序), open and close打开和关闭
outb,操作 I/O 端口
outb_p,暂停 I/O
outl,操作 I/O 端口
outsb,字符串操作
outsl,字符串操作
outsw,字符串操作
outw,操作 I/O 端口
面向页面的分配,get_free_page 和 Friends快速参考
pci_map-sg,分散/聚集映射
pci_remove_bus_device,删除设备
pci_resource_,访问 I/O 和内存空间
pfn_to_page,内存映射和结构页
poll_wait,轮询和选择快速参考
printk、Hello World 模块printk打印设备编号如何记录消息如何记录消息打开和关闭消息seq_file 接口
循环缓冲区,用于如何记录消息
记录消息,消息如何记录
seq_file 接口(避免 in), seq_file 接口
打开/关闭调试消息,打开和关闭消息
探针 (USB)、探针和断开详细信息
probe_irq_off,内核辅助探测
probe_irq_on,内核辅助探测
put_unaligned,数据对齐
put_user,使用 ioctl 参数快速参考
队列、排队功能
rdtscl,处理器特定寄存器
读取(tty 驱动程序),没有读取功能?
read_proc,在/proc中实现文件
register_blkdev,块驱动程序注册
寄存器_chrdev、udev
register_netdev,初始化每个设备
relaease_dma_lock,与 DMA 控制器对话
release (kobjects)、Release 函数和 kobject 类型
remap_pfn_range,使用 remap_pfn_range
remove_proc_entry,创建 /proc 文件
请求(块驱动程序),请求处理无需请求队列
request_dma,注册 DMA 使用
request_firmware,内核固件接口
SAK,系统挂起
sbull_request,在sbull中初始化
计划、系统挂起快速参考让出处理器
执行代码(延迟),让出处理器
防止无限循环,系统挂起
Schedule_timeout,超时
scull,open 方法open 方法release 方法
open 方法, open 方法open 方法
释放方法,释放方法
scull_cleanup,在打开时克隆设备
scull_getwritespace,手动休眠
信号量、信号量和互斥体(参见信号量)
set_dma_addr,与 DMA 控制器对话
set_dma_count,与 DMA 控制器对话
set_dma_mode,与 DMA 控制器对话
set_mb、I/O 寄存器和常规内存
set_multicast_list,典型实现
set_rmb、I/O 寄存器和常规内存
设置_termios,设置_termios
set_wmb、I/O 寄存器和常规内存
sg_dma_address,直接内存访问
sg_dma_len,直接内存访问
显示,驱动程序结构嵌入
skb_headlen,作用于套接字缓冲区的函数
skb_headroom,作用于套接字缓冲区的函数
skb_is_nonlinear,作用于套接字缓冲区的函数
skb_pull,作用于套接字缓冲区的函数
skb_push,作用于套接字缓冲区的函数
skb_put,作用于套接字缓冲区的函数
skb_reserve,作用于套接字缓冲区的函数
skb_tailroom,作用于套接字缓冲区的函数
sleep_on,古代历史:sleep_on
作用于套接字缓冲区,作用于套接字缓冲区的函数
自旋锁,自旋锁函数
struct module *owner,注册USB驱动
sysfs 文件系统、Sysfs 操作
sys_syslog、打印k
tasklet_schedule,Tasklet
tiny_close,打开和关闭
tiocmget、tiocmget 和 tiocmset
tiomset、tiocmget 和 tiocmset
tr_configure,接口信息
tty 驱动程序,快速参考
tty_driver(指针)、tty_driver 函数指针未读取函数?
tty_get_baud_rate、set_termios
tty_register_driver,一个小型 TTY 驱动程序
unregister_netdev,模块卸载
unsigned int irq,安装中断处理程序
无符号长标志,安装中断处理程序
unsigned long pci_resource_end,访问 I/O 和内存空间
unsigned long pci_resource_start,访问 I/O 和内存空间
unsigned pci_resource_flags,访问 I/O 和内存空间
up,Linux 信号量实现
urbs_completion,正在完成 Urbs:完成回调处理程序
usb_alloc_urb,创建和销毁 Urb
usb_bulk_msg, usb_bulk_msg
usb_control_msg, usb_control_msg
usb_fill_bulk_urb,批量 urb
usb_fill_control_urb,控制urb
usb_fill_int_urb,中断 urbs
usb_get_descriptor,其他 USB 数据函数
usb_kill_urb,取消 Urb
usb_register_dev、探测和断开详细信息
usb_set_intfdata、探测和断开详细信息
usb_string,其他 USB 数据函数
usb_submit_urb,提交 Urbs
usb_unlink_urb,取消 Urb
vfree、vmalloc 和朋友
virt_to_page,内存映射和结构页
vmalloc 分配、vmalloc 和朋友使用虚拟地址的 scull:scullv
void,注册 USB 驱动程序
空屏障、I/O 寄存器和传统内存
void blk_queue_bounce_limit,队列控制函数
void blk_queue_dma_alignment,队列控制函数
void blk_queue_hardsect_size,队列控制函数
void blk_queue_max_hw_segments,队列控制函数
void blk_queue_max_phys_segments,队列控制函数
void blk_queue_max_sectors,队列控制函数
void blk_queue_max_segment_size,队列控制函数
void blk_start_queue,队列控制函数
void blk_stop_queue,队列控制函数
void mb、I/O 寄存器和常规内存
void read_barrier_depends,I/O 寄存器和常规内存
void rmb、I/O 寄存器和常规内存
void smp_mb,I/O 寄存器和常规内存
void smp_rmb,I/O 寄存器和常规内存
void smp_wmb,I/O 寄存器和常规内存
void tasklet_disable,Tasklet
void tasklet_disable_nosync,Tasklet
void tasklet_enable,Tasklet
void tasklet_hi_schedule,Tasklet
void tasklet_kill,Tasklet
无效tasklet_schedule,Tasklet
void wmb、I/O 寄存器和常规内存
void*dev_id,安装中断处理程序
wait_event_interruptible_timeout,超时
唤醒、简单睡眠快速参考
wake_up、独占等待快速参考
wake_up_interruptible,快速参考
wake_up_interruptible_sync,快速参考
wake_up_sync,快速参考
工作队列,工作队列
写入(tty 驱动程序)、数据流
xmit_lock,实用程序字段
access_ok, Using the ioctl Argument
alloc_netdev, Initializing Each Device
alloc_skb, Functions Acting on Socket Buffers
alloc_tty_driver, A Small TTY Driver
blkdev_dequeue_request, Queueing functions
blk_cleanup_queue, Queue creation and deletion
blk_queue_hardsect_size, A Note on Sector Sizes
blk_queue_segment_boundary, Queue control functions
block drivers, Quick ReferenceQuick Reference
buses, Buses, Devices, and Drivers
bus_add_driver, Add a Driver
bus_for_each_dev, Iterating over devices and drivers
bus_register, Bus registration
calling from modules/applications, Kernel Modules Versus Applications
capable, Capabilities and Restricted Operations, Quick Reference
chars_in_buffer, Other Buffering Functions
claim_dma_lock, Talking to the DMA controller
classes, Classes
class_simple_create, udev
class_simple_device_add, udev
class_simple_device_remove, udev
cleanup, The Cleanup Function
clear_dma_ff, Talking to the DMA controller
close (tty drivers), open and closeopen and close
complete (urbs), Completing Urbs: The Completion Callback Handler
const char *dev_name, Installing an Interrupt Handler
const char *name, Registering a USB Driver
const struct usb_device_id*id_table, Registering a USB Driver
constructor (kmem_cache_create), Lookaside Caches
create_proc_read_entry, Creating your /proc file
del_timer_sync, The Timer API
devices, Buses, Devices, and Drivers
dev_alloc_skb, Functions Acting on Socket Buffers
dev_kfree_skb, The Interrupt Handler, Functions Acting on Socket Buffers
disable_dma, Talking to the DMA controller
disable_irq, Installing a Shared Handler
disconnect (USB), Registering a USB Driver, probe and disconnect in Detail
dma_free_coherent, Setting up coherent DMA mappings
double underscore (__), A Few Other Details
down, The Linux Semaphore Implementation
do_close, open and close
do_gettimeofday, Knowing the Current Time
do_IRQ, The internals of interrupt handling on the x86
drivers, Buses, Devices, and Drivers
driver_unregister, Remove a Driver
elv_next_request, A Simple request Method, Queueing functions, Command Pre-Preparation
enable_dma, Talking to the DMA controller
enable_irq, Installing a Shared Handler
ether_setup, Initializing Each Device, Interface InformationThe Device Methods
fasync_helper, The Driver's Point of View, Quick Reference
faulty_read, Oops Messages
faulty_write, Oops Messages
fc_setup, Interface Information
fddi_setup, Interface Information
firmware, Firmware
free_dma, Registering DMA usage
free_irq, Installing a Shared Handler
free_netdev, Module Unloading
free_pages, get_free_page and Friends
get_cycles, Processor-Specific Registers
get_dma_residue, Talking to the DMA controller
get_fast_time, Knowing the Current Time
get_free_page, get_free_page and Friends
get_free_pages, The Flags Argument, get_free_page and Friends, vmalloc and Friends
get_page, Mapping Memory with nopage
get_unaligned, Data Alignment
get_user, Using the ioctl Argument, Quick Reference
get_user_pages, Performing Direct I/O
get_zeroed_page, get_free_page and Friends
handle_IRQ_event, The internals of interrupt handling on the x86
hello world module, The Hello World Module
hippi_setup, Interface Information
inb, Manipulating I/O ports
inb_p, Pausing I/O
initialization, Initialization and ShutdownModule-Loading Races
inl, Manipulating I/O ports
insb, String Operations
inserting schedules, System Hangs
insl, String Operations
insw, String Operations
int (USB), Registering a USB Driver
int pci_enable_device, Enabling the PCI Device
int seq_escape, The seq_file interface
int seq_path, The seq_file interface
int seq_printf, The seq_file interface
int seq_putc, The seq_file interface
int seq_puts, The seq_file interface
inw, Manipulating I/O ports
in_atomic, Kernel Timers
in_interrupt, Kernel Timers
ioctl (tty drivers), ioctls
ioremap, vmalloc and Friends, Using I/O Memory, Quick Reference
ioremap_nocache, I/O Memory Allocation and Mapping
iounmap, vmalloc and Friends, I/O Memory Allocation and Mapping
irqreturn_t, Installing an Interrupt Handler
isa_readb, isa_readb and Friends
kfree_skb, Functions Acting on Socket Buffers
kill_fasync, The Driver's Point of View, Quick Reference
kmalloc, scull's Memory Usage, The Real Story of kmallocThe Size Argument, get_free_page and Friends
allocation engine, The Real Story of kmallocThe Size Argument
performance degradation issues, get_free_page and Friends
kmap, The Memory Map and Struct Page
kmap_skb_frag, Functions Acting on Socket Buffers
kmem_cache_alloc, Lookaside Caches
kmem_cache_create, Lookaside Caches
kmem_cache_t type, Lookaside Caches
list_add, Linked Lists
list_add_tail, Linked Lists
list_del, Linked Lists
list_empty, Linked Lists
list_move, Linked Lists
list_splice, Linked Lists
locking, Ambiguous Rules
match (buses), Bus methods
module_init, Initialization and Shutdown
mod_timer, The Timer API, The Implementation of Kernel Timers
netif_carrier_off, Changes in Link State
netif_carrier_ok, Changes in Link State
netif_carrier_on, Changes in Link State
netif_start_queue, Opening and Closing
netif_stop_queue, Opening and Closing, Controlling Transmission Concurrency
netif_wake_queue, Controlling Transmission Concurrency
network drivers, Quick ReferenceQuick Reference
open (tty drivers), open and closeopen and close
outb, Manipulating I/O ports
outb_p, Pausing I/O
outl, Manipulating I/O ports
outsb, String Operations
outsl, String Operations
outsw, String Operations
outw, Manipulating I/O ports
page-oriented allocation, get_free_page and Friends, Quick Reference
pci_map-sg, Scatter/gather mappings
pci_remove_bus_device, Remove a Device
pci_resource_, Accessing the I/O and Memory Spaces
pfn_to_page, The Memory Map and Struct Page
poll_wait, poll and select, Quick Reference
printk, The Hello World Module, printkPrinting Device Numbers, How Messages Get Logged, How Messages Get Logged, Turning the Messages On and Off, The seq_file interface
circular buffers for, How Messages Get Logged
logging messages from, How Messages Get Logged
seq_file interface (avoiding in), The seq_file interface
turning debug messages on/off, Turning the Messages On and Off
probe (USB), probe and disconnect in Detail
probe_irq_off, Kernel-assisted probing
probe_irq_on, Kernel-assisted probing
put_unaligned, Data Alignment
put_user, Using the ioctl Argument, Quick Reference
queues, Queueing functions
rdtscl, Processor-Specific Registers
read (tty drivers), No read Function?
read_proc, Implementing files in /proc
register_blkdev, Block Driver Registration
register_chrdev, udev
register_netdev, Initializing Each Device
relaease_dma_lock, Talking to the DMA controller
release (kobjects), Release functions and kobject types
remap_pfn_range, Using remap_pfn_range
remove_proc_entry, Creating your /proc file
request (block drivers), Request ProcessingDoing without a request queue
request_dma, Registering DMA usage
request_firmware, The Kernel Firmware Interface
SAK, System Hangs
sbull_request, Initialization in sbull
schedule, System Hangs, Quick Reference, Yielding the processor
execution of code (delaying), Yielding the processor
preventing endless loops with, System Hangs
schedule_timeout, Timeouts
scull, The open MethodThe open Method, The release Method
open method, The open MethodThe open Method
release method, The release Method
scull_cleanup, Cloning the Device on open
scull_getwritespace, Manual sleeps
semaphores, Semaphores and Mutexes (see semaphores)
set_dma_addr, Talking to the DMA controller
set_dma_count, Talking to the DMA controller
set_dma_mode, Talking to the DMA controller
set_mb, I/O Registers and Conventional Memory
set_multicast_list, A Typical Implementation
set_rmb, I/O Registers and Conventional Memory
set_termios, set_termios
set_wmb, I/O Registers and Conventional Memory
sg_dma_address, Direct Memory Access
sg_dma_len, Direct Memory Access
show, Driver structure embedding
skb_headlen, Functions Acting on Socket Buffers
skb_headroom, Functions Acting on Socket Buffers
skb_is_nonlinear, Functions Acting on Socket Buffers
skb_pull, Functions Acting on Socket Buffers
skb_push, Functions Acting on Socket Buffers
skb_put, Functions Acting on Socket Buffers
skb_reserve, Functions Acting on Socket Buffers
skb_tailroom, Functions Acting on Socket Buffers
sleep_on, Ancient history: sleep_on
acting on socket buffers, Functions Acting on Socket Buffers
spinlocks, The Spinlock Functions
struct module *owner, Registering a USB Driver
sysfs filesystem, Sysfs Operations
sys_syslog, printk
tasklet_schedule, Tasklets
tiny_close, open and close
tiocmget, tiocmget and tiocmset
tiomset, tiocmget and tiocmset
tr_configure, Interface Information
tty drivers, Quick Reference
tty_driver (pointers), tty_driver Function PointersNo read Function?
tty_get_baud_rate, set_termios
tty_register_driver, A Small TTY Driver
unregister_netdev, Module Unloading
unsigned int irq, Installing an Interrupt Handler
unsigned long flags, Installing an Interrupt Handler
unsigned long pci_resource_end, Accessing the I/O and Memory Spaces
unsigned long pci_resource_start, Accessing the I/O and Memory Spaces
unsigned pci_resource_flags, Accessing the I/O and Memory Spaces
up, The Linux Semaphore Implementation
urbs_completion, Completing Urbs: The Completion Callback Handler
usb_alloc_urb, Creating and Destroying Urbs
usb_bulk_msg, usb_bulk_msg
usb_control_msg, usb_control_msg
usb_fill_bulk_urb, Bulk urbs
usb_fill_control_urb, Control urbs
usb_fill_int_urb, Interrupt urbs
usb_get_descriptor, Other USB Data Functions
usb_kill_urb, Canceling Urbs
usb_register_dev, probe and disconnect in Detail
usb_set_intfdata, probe and disconnect in Detail
usb_string, Other USB Data Functions
usb_submit_urb, Submitting Urbs
usb_unlink_urb, Canceling Urbs
vfree, vmalloc and Friends
virt_to_page, The Memory Map and Struct Page
vmalloc allocation, vmalloc and FriendsA scull Using Virtual Addresses: scullv
void, Registering a USB Driver
void barrier, I/O Registers and Conventional Memory
void blk_queue_bounce_limit, Queue control functions
void blk_queue_dma_alignment, Queue control functions
void blk_queue_hardsect_size, Queue control functions
void blk_queue_max_hw_segments, Queue control functions
void blk_queue_max_phys_segments, Queue control functions
void blk_queue_max_sectors, Queue control functions
void blk_queue_max_segment_size, Queue control functions
void blk_start_queue, Queue control functions
void blk_stop_queue, Queue control functions
void mb, I/O Registers and Conventional Memory
void read_barrier_depends, I/O Registers and Conventional Memory
void rmb, I/O Registers and Conventional Memory
void smp_mb, I/O Registers and Conventional Memory
void smp_rmb, I/O Registers and Conventional Memory
void smp_wmb, I/O Registers and Conventional Memory
void tasklet_disable, Tasklets
void tasklet_disable_nosync, Tasklets
void tasklet_enable, Tasklets
void tasklet_hi_schedule, Tasklets
void tasklet_kill, Tasklets
void tasklet_schedule, Tasklets
void wmb, I/O Registers and Conventional Memory
void*dev_id, Installing an Interrupt Handler
wait_event_interruptible_timeout, Timeouts
wake-up, Simple Sleeping, Quick Reference
wake_up, Exclusive waits, Quick Reference
wake_up_interruptible, Quick Reference
wake_up_interruptible_sync, Quick Reference
wake_up_sync, Quick Reference
workqueues, Workqueues
write (tty drivers), Flow of Data
xmit_lock, Utility Fields
f_dentry指针,文件结构
f_dentry pointer, The file Structure
f_flags 字段(文件结构)、文件结构预定义命令阻塞和非阻塞操作
f_flags field (file structure), The file Structure, The Predefined Commands, Blocking and Nonblocking Operations
O_NONBLOCK 标志,预定义命令阻塞和非阻塞操作
O_NONBLOCK flag, The Predefined Commands, Blocking and Nonblocking Operations
f_mode字段(文件结构),文件结构
f_mode field (file structure), The file Structure
f_op指针,文件结构
f_op pointer, The file Structure
f_pos字段(文件结构),文件结构实现/proc中的文件
f_pos field (file structure), The file Structure, Implementing files in /proc
read_proc 函数以及,在 /proc 中实现文件
read_proc function and, Implementing files in /proc
F_SETFL命令,预定义命令异步通知
F_SETFL command, The Predefined Commands, Asynchronous Notification
fcntl 系统调用和异步通知
fcntl system call and, Asynchronous Notification
F_SETFL fcntl 命令,异步通知
F_SETFL fcntl command, Asynchronous Notification
F_SETOWN命令,异步通知异步通知
F_SETOWN command, Asynchronous Notification, Asynchronous Notification
fcntl 系统调用和异步通知
fcntl system call and, Asynchronous Notification

G

G

gcc 编译器,处理器特定寄存器
gcc compiler, Processor-Specific Registers
gdb 命令、使用 gdbkgdb 补丁
gdb commands, Using gdb, The kgdb Patches
gendisk 结构, gendisk 结构
gendisk structure, The gendisk structure
通用发行版、为其编写驱动程序、平台依赖性
general distribution, writing drivers for, Platform Dependency
通用公共许可证 (GPL)、许可条款
General Public License (GPL), License Terms
通用 DMA 层,通用 DMA 层
generic DMA layers, The Generic DMA Layer
通用 I/O 地址空间,访问 I/O 和内存空间
generic I/O address spaces, Accessing the I/O and Memory Spaces
地理寻址、PCI 寻址
geographical addressing, PCI Addressing
get_cycles 函数,处理器特定寄存器
get_cycles function, Processor-Specific Registers
get_dma_residue 函数,与 DMA 控制器对话
get_dma_residue function, Talking to the DMA controller
get_fast_time函数,了解当前时间
get_fast_time function, Knowing the Current Time
get_free_page 函数、get_free_page 和朋友
get_free_page function, get_free_page and Friends
get_free_pages 函数、标志参数get_free_page 和朋友vmalloc 和朋友
get_free_pages function, The Flags Argument, get_free_page and Friends, vmalloc and Friends
get_kernel_syms 系统调用,加载和卸载模块
get_kernel_syms system call, Loading and Unloading Modules
get_page函数,用nopage映射内存
get_page function, Mapping Memory with nopage
get_stats方法,设备方法统计信息
get_stats method, The Device Methods, Statistical Information
get_unaligned 函数,数据对齐
get_unaligned function, Data Alignment
get_user 函数,使用 ioctl 参数快速参考
get_user function, Using the ioctl Argument, Quick Reference
get_user_pages 函数,执行直接 I/O
get_user_pages function, Performing Direct I/O
get_zeroed_pa​​ge 函数、get_free_page 和朋友
get_zeroed_page function, get_free_page and Friends
gfp.h 头文件,标志参数
gfp.h header file, The Flags Argument
GFP_ATOMIC 标志、标志参数get_free_page 和 Friendsget_free_page 和 Friends
GFP_ATOMIC flag, The Flags Argument, get_free_page and Friends, get_free_page and Friends
面向页面的分配函数,get_free_page 和 Friends
准备分配失败、get_free_page 和 Friends
page-oriented allocation functions, get_free_page and Friends
preparing for allocation failure, get_free_page and Friends
GFP_COLD 标志,标志参数
GFP_COLD flag, The Flags Argument
GFP_DMA 标志,标志参数
GFP_DMA flag, The Flags Argument
GFP_HIGH 标志,标志参数
GFP_HIGH flag, The Flags Argument
GFP_HIGHMEM 标志,标志参数
GFP_HIGHMEM flag, The Flags Argument
GFP_HIGHUSER 标志,标志参数
GFP_HIGHUSER flag, The Flags Argument
GFP_KERNEL 标志、标志参数get_free_page 和朋友
GFP_KERNEL flag, The Flags Argument, get_free_page and Friends
GFP_NOFAIL 标志,标志参数
GFP_NOFAIL flag, The Flags Argument
GFP_NOFS 标志,标志参数
GFP_NOFS flag, The Flags Argument
GFP_NOIO 标志,标志参数
GFP_NOIO flag, The Flags Argument
GFP_NORETRY 标志,标志参数
GFP_NORETRY flag, The Flags Argument
GFP_NOWARN 标志,标志参数
GFP_NOWARN flag, The Flags Argument
GFP_REPEAT 标志,标志参数
GFP_REPEAT flag, The Flags Argument
GFP_USER 标志,标志参数
GFP_USER flag, The Flags Argument
全局信息(net_device结构体),Global Information
global information (net_device structure), Global Information
全局内存区域,scull 的设计
global memory areas, The Design of scull
全局消息(启用/禁用),打开和关闭消息
global messages (enabling/disabling), Turning the Messages On and Off
GNU 通用公共许可证 (GPL),许可条款
GNU General Public License (GPL), License Terms
goto 语句,初始化期间的错误处理初始化期间的错误处理
goto statement, Error Handling During Initialization, Error Handling During Initialization
GPL(GNU 通用公共许可证),许可条款
GPL (GNU General Public License), License Terms
组、设备、主号码动态分配
group, device, Dynamic Allocation of Major Numbers

H

H

黑客内核选项,内核中的调试支持内核中的调试支持
hacking kernels options, Debugging Support in the KernelDebugging Support in the Kernel
handle_IRQ_event 函数,x86 上中断处理的内部原理
handle_IRQ_event function, The internals of interrupt handling on the x86
挂起(系统),系统挂起系统挂起
hangs (system), System HangsSystem Hangs
硬件、ioctl没有 ioctl 的设备控制I/O 端口和 I/O 内存isa_readb 和朋友快速参考硬件抽象硬件资源直接内存访问处理困难硬件支持可移动介质硬件信息接口信息设备方法设备方法打开和关闭作用于套接字缓冲区的函数覆盖 ARP非以太网标头
hardware, ioctlDevice Control Without ioctl, I/O Ports and I/O Memoryisa_readb and Friends, Quick Reference, Hardware Abstractions, Hardware Resources, Direct Memory Access, Dealing with difficult hardware, Supporting Removable Media, Hardware Information, Interface Information, The Device Methods, The Device Methods, Opening and Closing, Functions Acting on Socket Buffers, Overriding ARP, Non-Ethernet Headers
地址、接口信息设备方法打开和关闭
分配、开始和结束
设备方法的修改
DMA,直接内存访问处理困难的硬件
标头、设备方法作用于套接字缓冲区的函数覆盖 ARP非以太网标头
在传输数据包之前添加作用于套接字缓冲区的函数
建筑,设备方法
封装信息,非以太网标头
ioctl 方法,ioctl不使用 ioctl 的设备控制
ISA、硬件资源
管理、I/O 端口和 I/O 内存isa_readb 和朋友快速参考
net_device结构体,硬件信息
PCI(抽象)、硬件抽象
可移动媒体(支持),支持可移动媒体
addresses, Interface Information, The Device Methods, Opening and Closing
assignment of, Opening and Closing
modification of, The Device Methods
DMA, Direct Memory Access, Dealing with difficult hardware
headers, The Device Methods, Functions Acting on Socket Buffers, Overriding ARP, Non-Ethernet Headers
adding before transmitting packets, Functions Acting on Socket Buffers
building, The Device Methods
encapsulating information, Non-Ethernet Headers
ioctl method, ioctlDevice Control Without ioctl
ISA, Hardware Resources
management, I/O Ports and I/O Memoryisa_readb and Friends, Quick Reference
net_device structure, Hardware Information
PCI (abstractions), Hardware Abstractions
removable media (supporting), Supporting Removable Media
Hard_header 方法、设备方法在以太网中使用 ARP
hard_header method, The Device Methods, Using ARP with Ethernet
hard_start_transmit方法,数据包传输
hard_start_transmit method, Packet Transmission
Hard_start_xmit方法,设备方法数据包传输
hard_start_xmit method, The Device Methods, Packet Transmission
标头、内核模块与应用程序内核符号表覆盖 ARP非以太网标头非以太网标头
headers, Kernel Modules Versus Applications, The Kernel Symbol Table, Overriding ARP, Non-Ethernet Headers, Non-Ethernet Headers
文件、内核模块与应用程序内核符号表
硬件、覆盖 ARP
非以太网、非以太网标头非以太网标头
files, Kernel Modules Versus Applications, The Kernel Symbol Table
hardware, Overriding ARP
non-Ethernet, Non-Ethernet Headers, Non-Ethernet Headers
header_cache 方法,设备方法
header_cache method, The Device Methods
header_cache_update 方法,设备方法
header_cache_update method, The Device Methods
你好世界模块,你好世界模块-你好世界模块
hello world module, The Hello World ModuleThe Hello World Module
层次结构、创建 /proc 文件创建 /proc 文件Kobject 层次结构、Kset 和子系统kset 上的操作
hierarchies, Creating your /proc file, Creating your /proc file, Kobject Hierarchies, Ksets, and Subsystems, Operations on ksets
(另请参阅文件系统)
kobject、Kobject 层次结构、Kset 和子系统
ksets,ksets 上的操作
/proc 文件连接,创建 /proc 文件
(see also filesystems)
kobjects, Kobject Hierarchies, Ksets, and Subsystems
ksets, Operations on ksets
/proc file connections, Creating your /proc file
高内存、内存区域高内存和低内存高内存和低内存
high memory, Memory zones, High and Low Memory, High and Low Memory
HIPPI 驱动程序,准备字段,接口信息
HIPPI drivers, preparing fields for, Interface Information
hippi_setup函数,接口信息
hippi_setup function, Interface Information
主机名(snull 接口)、分配 IP 号
hostnames (snull interfaces), Assigning IP Numbers
hotplugs、Linux 设备模型Hotplug 事件生成HotplugudevLinux hotplug 脚本
hotplugs, The Linux Device Model, Hotplug Event Generation, Hotplugudev, Linux hotplug scripts
设备,Linux 设备模型
事件,热插拔事件生成
Linux 设备模型,Hotplugudev
脚本、Linux 热插拔脚本
devices, The Linux Device Model
events, Hotplug Event Generation
Linux device model, Hotplugudev
scripts, Linux hotplug scripts
集线器 (USB)、USB 和 Sysfs
hubs (USB), USB and Sysfs
挂起系统、系统挂起
hung system, System Hangs
超线程处理器,避免死锁,自旋锁
hyperthreaded processors, avoiding deadlocks, Spinlocks
HZ(时间频率)符号、测量时间间隔时间间隔
HZ (time frequency) symbol, Measuring Time Lapses, Time Intervals

I

I/O、阻塞 I/O测试 Scullpipe 驱动程序刷新挂起输出刷新挂起输出I/O 端口和 I/O 内存isa_readb 和朋友I/O 寄存器和常规内存字符串操作暂停 I/ O暂停 I/O使用 I/O 内存I/O 内存分配和映射快速参考中断驱动 I/O写缓冲示例PCI 寻址访问 I/O 和内存空间访问 I/O 和内存空间重新映射特定 I/O 区域执行直接 I/O异步 I/O 示例异步 I/O异步 I/O 示例实现直接 I/O请求队列分散/聚集 I/O
I/O, Blocking I/OTesting the Scullpipe Driver, Flushing pending output, Flushing pending output, I/O Ports and I/O Memoryisa_readb and Friends, I/O Registers and Conventional Memory, String Operations, Pausing I/O, Pausing I/O, Using I/O Memory, I/O Memory Allocation and Mapping, Quick Reference, Interrupt-Driven I/OA Write-Buffering Example, PCI Addressing, Accessing the I/O and Memory Spaces, Accessing the I/O and Memory Spaces, Remapping Specific I/O Regions, Performing Direct I/OAn asynchronous I/O example, Asynchronous I/OAn asynchronous I/O example, Implementing Direct I/O, Request Queues, Scatter/Gather I/O
asynchronous,异步 I/O异步 I/O 示例
阻塞、阻塞 I/O测试 Scullpipe 驱动程序
direct,执行直接 I/O异步 I/O 示例实现直接 I/O
冲洗挂起,冲洗挂起输出
通用地址空间,访问 I/O 和内存空间
硬件管理、I/O 端口和 I/O 内存isa_readb 和朋友
中断处理程序、中断驱动 I/O写缓冲示例
映射、I/O 内存分配和映射快速参考
内存(访问),使用 I/O 内存
暂停、暂停 I/O暂停 I/O
PCI、PCI 寻址访问 I/O 和内存空间
区域,重新映射特定 I/O 区域
寄存器、I/O 寄存器和传统存储器
分散/聚集、分散/聚集 I/O
调度程序、请求队列
字符串操作,字符串操作
asynchronous, Asynchronous I/OAn asynchronous I/O example
blocking, Blocking I/OTesting the Scullpipe Driver
direct, Performing Direct I/OAn asynchronous I/O example, Implementing Direct I/O
flushing pending, Flushing pending output
generic address spaces, Accessing the I/O and Memory Spaces
hardware management, I/O Ports and I/O Memoryisa_readb and Friends
interrupt handlers, Interrupt-Driven I/OA Write-Buffering Example
mapping, I/O Memory Allocation and Mapping, Quick Reference
memory (access), Using I/O Memory
pausing, Pausing I/O, Pausing I/O
PCI, PCI Addressing, Accessing the I/O and Memory Spaces
regions, Remapping Specific I/O Regions
registers, I/O Registers and Conventional Memory
scatter/gather, Scatter/Gather I/O
schedulers, Request Queues
string operations, String Operations
I/O 寄存器与 RAM、I/O 寄存器和传统存储器
I/O registers versus RAM, I/O Registers and Conventional Memory
I2O 驱动程序、设备类和模块
I2O drivers, Classes of Devices and Modules
IA-64 架构、平台依赖性/proc 接口
IA-64 architecture, Platform Dependencies, The /proc Interface
移植和平台依赖性
/proc/interrupts 文件,/proc 接口的快照
porting and, Platform Dependencies
/proc/interrupts file, snapshot of, The /proc Interface
IEEE1394总线(火线)、IEEE1394(火线)
IEEE1394 bus (Firewire), IEEE1394 (FireWire)
if.h头文件,接口信息自定义ioctl命令
if.h header file, Interface Information, Custom ioctl Commands
ifconfig命令、硬件信息打开和关闭打开和关闭
ifconfig command, Hardware Information, Opening and ClosingOpening and Closing
net_device 结构和硬件信息
打开网络驱动程序,打开和关闭打开和关闭
net_device structure and, Hardware Information
opening network drivers, Opening and ClosingOpening and Closing
IFF_符号、接口信息内核对多播的支持
IFF_ symbols, Interface Information, Kernel Support for Multicasting
IFF_ALLMULTI 标志,接口信息
IFF_ALLMULTI flag, Interface Information
IFF_AUTOMEDIA 标志,接口信息
IFF_AUTOMEDIA flag, Interface Information
IFF_BROADCAST 标志,接口信息
IFF_BROADCAST flag, Interface Information
IFF_DEBUG 标志,接口信息
IFF_DEBUG flag, Interface Information
IFF_DYNAMIC 标志,接口信息
IFF_DYNAMIC flag, Interface Information
IFF_LOOPBACK 标志,接口信息
IFF_LOOPBACK flag, Interface Information
IFF_MASTER 标志,接口信息
IFF_MASTER flag, Interface Information
IFF_MULTICAST 标志,接口信息
IFF_MULTICAST flag, Interface Information
IFF_NOARP标志,初始化每个设备接口信息
IFF_NOARP flag, Initializing Each Device, Interface Information
IFF_NOTRAILERS 标志,接口信息
IFF_NOTRAILERS flag, Interface Information
IFF_POINTOPOINT 标志,接口信息
IFF_POINTOPOINT flag, Interface Information
IFF_PORTSEL 标志,接口信息
IFF_PORTSEL flag, Interface Information
IFF_PROMISC 标志,接口信息
IFF_PROMISC flag, Interface Information
IFF_RUNNING 标志,接口信息
IFF_RUNNING flag, Interface Information
IFF_SLAVE 标志,接口信息
IFF_SLAVE flag, Interface Information
IFF_UP 标志,接口信息
IFF_UP flag, Interface Information
ifreq 结构,自定义 ioctl 命令
ifreq structure, Custom ioctl Commands
实现、设备驱动程序的角色设备和模块的类打开和关闭消息在 /proc 中实现文件Linux 信号量实现-读取器/写入器信号量ioctl 命令的实现寻找设备忙碌等待内核定时器的实现实现处理程序禁用所有中断回顾:ISA即插即用规范Linux 中的内存管理高内存和低内存异步 I/O实现 mmap实现直接 I/O支持可移动媒体典型实现
implementation, The Role of the Device Driver, Classes of Devices and Modules, Turning the Messages On and Off, Implementing files in /proc, The Linux Semaphore ImplementationReader/Writer Semaphores, The Implementation of the ioctl Commands, Seeking a Device, Busy waiting, The Implementation of Kernel Timers, Implementing a HandlerDisabling all interrupts, A Look Back: ISAThe Plug-and-Play Specification, Memory Management in LinuxHigh and Low Memory, Asynchronous I/O, Implementing mmap, Implementing Direct I/O, Supporting Removable Media, A Typical Implementation
异步 I/O,异步 I/O
忙等待,忙等待
类的数量、设备和模块的类
调试级别,打开和关闭消息
直接 I/O,实现直接 I/O
/proc 文件系统中的文件数,在 /proc 中实现文件
中断处理程序,实现处理程序禁用所有中断
ioctl 命令,ioctl 命令的实现
ISA (PCI),回顾:ISA即插即用规范
llseek 方法,寻找设备
mmap,Linux 中的内存管理高内存和低内存实现 mmap
多播,典型实现
策略,设备驱动程序的角色
可移动媒体(支持),支持可移动媒体
信号量,Linux 信号量实现读取器/写入器信号量
定时器,内核定时器的实现
asynchronous I/O, Asynchronous I/O
busy-waiting, Busy waiting
of classes, Classes of Devices and Modules
of debugging levels, Turning the Messages On and Off
direct I/O, Implementing Direct I/O
of files in /proc filesystems, Implementing files in /proc
interrupt handlers, Implementing a HandlerDisabling all interrupts
ioctl commands, The Implementation of the ioctl Commands
ISA (PCI), A Look Back: ISAThe Plug-and-Play Specification
llseek method, Seeking a Device
mmap, Memory Management in LinuxHigh and Low Memory, Implementing mmap
multicasting, A Typical Implementation
of policies, The Role of the Device Driver
removable media (supporting), Supporting Removable Media
semaphores, The Linux Semaphore ImplementationReader/Writer Semaphores
timers, The Implementation of Kernel Timers
inb 函数,操作 I/O 端口
inb function, Manipulating I/O ports
inb_p 函数,暂停 I/O
inb_p function, Pausing I/O
无限循环,防止系统挂起
infinite loops, preventing, System Hangs
信息泄露、安全问题
information leakage, Security Issues
初始化脚本和加载/卸载模块,主要号码的动态分配
init scripts and loading/unloading modules, Dynamic Allocation of Major Numbers
init.h 头文件,快速参考
init.h header file, Quick Reference
初始化、初始化和关闭模块加载竞赛Char 设备注册旧方法Linux 信号量实现读/写器信号量完成seqlocks安装中断处理程序PCI 寻址配置寄存器和初始化注册 USB 驱动程序, Kobject 初始化, gendisk 结构, sbull 中的初始化,设备注册,初始化每个设备
initialization, Initialization and ShutdownModule-Loading Races, Char Device RegistrationThe Older Way, The Linux Semaphore Implementation, Reader/Writer Semaphores, Completions, seqlocks, Installing an Interrupt Handler, PCI Addressing, Configuration Registers and Initialization, Registering a USB Driver, Kobject initialization, The gendisk structure, Initialization in sbull, Device Registration, Initializing Each Device
完成(信号量),完成
设备,初始化每个设备
gendisk 结构, gendisk 结构
中断处理程序,安装中断处理程序
kobjects,Kobject初始化
模块、初始化和关闭模块加载竞赛
互斥体,Linux 信号量实现
net_device结构体,设备注册
PCI、PCI 寻址
读取器/写入器信号量、读取器/写入器信号量
寄存器 (PCI)、配置寄存器和初始化
sbull 驱动程序,sbull 中的初始化
序列锁,序列锁
struct usb_driver结构体,注册USB驱动程序
结构(注册),Char 设备注册旧方法
completions (semaphores), Completions
devices, Initializing Each Device
gendisk structure, The gendisk structure
interrupt handlers, Installing an Interrupt Handler
kobjects, Kobject initialization
modules, Initialization and ShutdownModule-Loading Races
mutexes, The Linux Semaphore Implementation
net_device structure, Device Registration
PCI, PCI Addressing
reader/writer semaphores, Reader/Writer Semaphores
registers (PCI), Configuration Registers and Initialization
sbull drivers, Initialization in sbull
seqlocks, seqlocks
struct usb_driver structure, Registering a USB Driver
structures (registration), Char Device RegistrationThe Older Way
INIT_LIST_HEAD 宏,链接列表
INIT_LIST_HEAD macro, Linked Lists
inl 函数,操作 I/O 端口
inl function, Manipulating I/O ports
内联汇编代码(示例),处理器特定寄存器
inline assembly code (example), Processor-Specific Registers
ioctl方法中的inode指针,ioctl
inode pointer in ioctl method, ioctl
索引节点结构,索引节点结构
inode structure, The inode Structure
输入设备(热插拔)、输入
input devices (hotplugging), Input
输入文件,启用异步通知,异步通知
input files, enabling asynchronous notification from, Asynchronous Notification
输入模块,内核符号表
input module, The Kernel Symbol Table
输入引脚、与硬件通信I/O 端口示例示例驱动程序
input pins, Communicating with Hardware, An I/O Port Example, A Sample Driver
从并行端口读取值,示例驱动程序
reading values from parallel port, A Sample Driver
insb 函数,字符串操作
insb function, String Operations
insl 函数,字符串操作
insl function, String Operations
insmod 程序、可加载模块可加载模块Hello World 模块Hello World 模块加载和卸载模块内核符号表模块参数主编号动态分配
insmod program, Loadable Modules, Loadable Modules, The Hello World Module, The Hello World Module, Loading and Unloading Modules, The Kernel Symbol Table, Module Parameters, Dynamic Allocation of Major Numbers
分配参数值,模块参数
动态分配主号码, Dynamic Allocation of Major Numbers
modprobe 程序与内核符号表
使用Hello World 模块测试模块
assigning parameter values, Module Parameters
dynamically allocating major numbers, Dynamic Allocation of Major Numbers
modprobe program versus, The Kernel Symbol Table
testing modules using, The Hello World Module
安装、设置测试系统安装中断处理程序x86 上中断处理的内部结构安装共享处理程序
installation, Setting Up Your Test System, Installing an Interrupt HandlerThe internals of interrupt handling on the x86, Installing a Shared Handler
中断处理程序、安装中断处理程序x86 上中断处理的内部结构安装共享处理程序
主线内核,设置您的测试系统
interrupt handlers, Installing an Interrupt HandlerThe internals of interrupt handling on the x86, Installing a Shared Handler
mainline kernels, Setting Up Your Test System
insw 函数,字符串操作
insw function, String Operations
int 实际长度字段 (USB),结构 urb
int actual_length field (USB), struct urb
int 数据类型,标准 C 类型的使用
int data type, Use of Standard C Types
int error_count 字段(USB),结构 urb
int error_count field (USB), struct urb
int 字段,注册 PCI 驱动程序全局信息
int field, Registering a PCI Driver, Global Information
net_device结构体,全局信息
PCI 注册,注册 PCI 驱动程序
net_device structure, Global Information
PCI registration, Registering a PCI Driver
int flags 字段 (gendisk),gendisk 结构
int flags field (gendisk), The gendisk structure
int 函数 (USB),注册 USB 驱动程序
int function (USB), Registering a USB Driver
int 区间字段 (USB),结构 urb
int interval field (USB), struct urb
int 主字段(gendisk),gendisk 结构
int major field (gendisk), The gendisk structure
int 小字段 (USB)、接口
int minor field (USB), Interfaces
int未成年人字段(gendisk),gendisk结构
int minors field (gendisk), The gendisk structure
int secondary_base 变量 (USB)、探测和断开详细信息
int minor_base variable (USB), probe and disconnect in Detail
int number_of_packets 字段(USB),结构 urb
int number_of_packets field (USB), struct urb
int pci_enable_device函数,启用PCI设备
int pci_enable_device function, Enabling the PCI Device
int seq_escape 函数,seq_file 接口
int seq_escape function, The seq_file interface
int seq_path 函数,seq_file 接口
int seq_path function, The seq_file interface
int seq_printf 函数,seq_file 接口
int seq_printf function, The seq_file interface
int seq_putc 函数,seq_file 接口
int seq_putc function, The seq_file interface
int seq_puts 函数,seq_file 接口
int seq_puts function, The seq_file interface
int start_frame 字段(USB),结构 urb
int start_frame field (USB), struct urb
int 状态字段 (USB),结构 urb
int status field (USB), struct urb
int Transfer_buffer_length 字段(USB),结构 urb
int transfer_buffer_length field (USB), struct urb
接口可变,USB
INTERFACE variable, USB
接口特定数据类型,接口特定类型
interface-specific data types, Interface-Specific Types
接口、设备和模块类版本依赖性清理函数旧方法旧接口seq_file 接口seq_file 接口读取器/写入器信号量Spinlock API 简介定时器 APIalloc_pages 接口准备并行端口接口特定类型PCI 接口-硬件抽象VLB接口配置kset 上的操作class_simple 接口完整类接口、类接口内核固件接口注册-关于扇区大小的注释块设备操作- ioctl 方法请求处理-不使用请求队列进行操作命令前期准备, Tagged Command QueuingTagged Command Queuing ,快速参考快速参考, snull 是如何设计的数据包的物理传输snull 的设计方式接口信息接口信息媒体独立接口支持
interfaces, Classes of Devices and Modules, Version Dependency, The Cleanup Function, The Older Way, An older interface, The seq_file interfaceThe seq_file interface, Reader/Writer Semaphores, Introduction to the Spinlock API, The Timer API, The alloc_pages Interface, Preparing the Parallel Port, Interface-Specific Types, The PCI InterfaceHardware Abstractions, VLB, Interfaces, Configurations, Operations on ksets, The class_simple Interface, The Full Class Interface, Class interfaces, The Kernel Firmware Interface, RegistrationA Note on Sector Sizes, The Block Device OperationsThe ioctl Method, Request ProcessingDoing without a request queue, Command Pre-Preparation, Tagged Command QueueingTagged Command Queueing, Quick ReferenceQuick Reference, How snull Is DesignedThe Physical Transport of Packets, How snull Is Designed, Interface Information, Interface Information, Media Independent Interface Support
alloc_pages,alloc_pages 接口
块驱动程序、注册关于扇区大小的说明块设备操作ioctl 方法请求处理不使用请求队列命令预先准备标记命令队列标记命令队列快速参考快速参考
命令准备,命令准备
功能,快速参考快速参考
操作,块设备操作ioctl 方法
注册、注册——关于部门规模的说明
请求处理,请求处理没有请求队列
TCQ,标记命令队列标记命令队列
类、类接口
class_simple,class_simple 接口
清理函数,清理函数
配置(USB),配置
固件,内核固件接口
net_device 结构的标志,接口信息
全类,全类接口
接口特定数据类型,接口特定类型
ksets,ksets 上的操作
环回,snull 是如何设计的
MII,媒体独立接口支持
网络、设备类别和模块
非以太网,接口信息
旧的,旧的方式旧的界面
字符设备注册,旧方法
/proc 文件实现,较旧的接口
并行端口,准备并行端口(请参阅并行端口)
PCI,PCI 接口硬件抽象
读取器/写入器信号量、读取器/写入器信号量
seq_file,seq_file 接口seq_file 接口
snull,snull 是如何设计的数据包的物理传输
自旋锁,自旋锁 API 简介
定时器,定时器 API
USB、接口
版本依赖,版本依赖
VLB, VLB
alloc_pages, The alloc_pages Interface
block drivers, RegistrationA Note on Sector Sizes, The Block Device OperationsThe ioctl Method, Request ProcessingDoing without a request queue, Command Pre-Preparation, Tagged Command QueueingTagged Command Queueing, Quick ReferenceQuick Reference
command pre-preparation, Command Pre-Preparation
functions, Quick ReferenceQuick Reference
operations, The Block Device OperationsThe ioctl Method
registration, RegistrationA Note on Sector Sizes
request processing, Request ProcessingDoing without a request queue
TCQ, Tagged Command QueueingTagged Command Queueing
classes, Class interfaces
class_simple, The class_simple Interface
cleanup function, The Cleanup Function
configuration (USB), Configurations
firmware, The Kernel Firmware Interface
flags for net_device structure, Interface Information
full class, The Full Class Interface
interface-specific data types, Interface-Specific Types
ksets, Operations on ksets
loopback, How snull Is Designed
MII, Media Independent Interface Support
networks, Classes of Devices and Modules
non-Ethernet, Interface Information
older, The Older Way, An older interface
char device registration, The Older Way
/proc file implementation, An older interface
parallel ports, Preparing the Parallel Port (see parallel ports)
PCI, The PCI InterfaceHardware Abstractions
reader/writer semaphores, Reader/Writer Semaphores
seq_file, The seq_file interfaceThe seq_file interface
snull, How snull Is DesignedThe Physical Transport of Packets
spinlocks, Introduction to the Spinlock API
timers, The Timer API
USB, Interfaces
version dependency, Version Dependency
VLB, VLB
内部函数(锁定),不明确的规则
internal functions (locking), Ambiguous Rules
设备编号的内部表示,设备编号的内部表示
internal representation of device numbers, The Internal Representation of Device Numbers
互联网协议 (IP)、网络驱动程序
Internet protocol (IP), Network Drivers
中断处理程序、自动检测 IRQ 号/proc 接口和共享中断
interrupt handlers, Autodetecting the IRQ Number, The /proc Interface and Shared Interrupts
自动检测 IRQ 编号,自动检测 IRQ 编号
共享中断、/proc 接口和共享中断
autodetecting IRQ numbers, Autodetecting the IRQ Number
sharing interrupts, The /proc Interface and Shared Interrupts
中断模式、内核定时器TaskletTasklet
interrupt mode, Kernel Timers, TaskletsTasklets
和异步执行,内核定时器
小任务,小任务小任务
and asynchronous execution, Kernel Timers
tasklets, TaskletsTasklets
可中断睡眠、手动睡眠
interruptible sleeps, Manual sleeps
中断、测量时间间隔准备并行端口安装中断处理程序x86 上中断处理的内部结构安装中断处理程序安装中断处理程序/proc 接口/proc 接口实现处理程序禁用所有中断上半部和下半部-工作队列Tasklet中断共享- /proc 接口和共享中断/proc 接口和共享中断中断驱动 I/O写缓冲示例快速参考快速参考PCI 中断中断 urbs中断处理程序中断处理程序接收中断缓解数据流ioctls
interrupts, Measuring Time Lapses, Preparing the Parallel Port, Installing an Interrupt HandlerThe internals of interrupt handling on the x86, Installing an Interrupt Handler, Installing an Interrupt Handler, The /proc Interface, The /proc Interface, Implementing a HandlerDisabling all interrupts, Top and Bottom HalvesWorkqueues, Tasklets, Interrupt SharingThe /proc Interface and Shared Interrupts, The /proc Interface and Shared Interrupts, Interrupt-Driven I/OA Write-Buffering Example, Quick Reference, Quick Reference, PCI Interrupts, Interrupt urbs, The Interrupt Handler, The Interrupt Handler, Receive Interrupt Mitigation, Flow of Data, ioctls
计数、ioctls
文件,/proc 接口
处理程序、准备并行端口安装中断处理程序x86 上中断处理的内部结构/proc 接口实现处理程序禁用所有中断上半部分和下半部分工作队列Tasklet中断共享/proc 接口和共享中断,中断驱动 I/O写缓冲示例,快速参考,快速参考,中断处理程序
I/O、中断驱动 I/O写缓冲示例
实现,实现处理程序禁用所有中断
安装,安装中断处理程序x86 上中断处理的内部结构
管理,快速参考
对于网络驱动程序,中断处理程序
准备并行端口,准备并行端口
/proc 文件,/proc 接口
注册、快速参考
共享、中断共享/proc 接口和共享中断
小任务,小任务
上半部和下半部,上半部和下半部工作队列
安装位于,安装中断处理程序
接收中断缓解
对于网络驱动程序,中断处理程序
PCI、PCI 中断
报告,安装中断处理程序
共享中断以及/proc 接口和共享中断
计时器,测量时间流逝
tty 驱动程序、数据流
urbs,中断 urbs
counts, ioctls
file, The /proc Interface
handlers, Preparing the Parallel Port, Installing an Interrupt HandlerThe internals of interrupt handling on the x86, The /proc Interface, Implementing a HandlerDisabling all interrupts, Top and Bottom HalvesWorkqueues, Tasklets, Interrupt SharingThe /proc Interface and Shared Interrupts, Interrupt-Driven I/OA Write-Buffering Example, Quick Reference, Quick Reference, The Interrupt Handler
I/O, Interrupt-Driven I/OA Write-Buffering Example
implementation of, Implementing a HandlerDisabling all interrupts
installation of, Installing an Interrupt HandlerThe internals of interrupt handling on the x86
management, Quick Reference
for network drivers, The Interrupt Handler
preparing parallel ports, Preparing the Parallel Port
/proc files for, The /proc Interface
registration, Quick Reference
sharing, Interrupt SharingThe /proc Interface and Shared Interrupts
tasklets, Tasklets
top and bottom halves, Top and Bottom HalvesWorkqueues
installation at, Installing an Interrupt Handler
mitigation of, Receive Interrupt Mitigation
for network drivers, The Interrupt Handler
PCI, PCI Interrupts
reports, Installing an Interrupt Handler
shared interrupts and, The /proc Interface and Shared Interrupts
timers, Measuring Time Lapses
tty drivers, Flow of Data
urbs, Interrupt urbs
时间间隔(数据类型可移植性),时间间隔
intervals of time (data type portability), Time Intervals
intptr_t 类型(C99 标准),标准 C 类型的使用
intptr_t type (C99 standard), Use of Standard C Types
inw 函数,操作 I/O 端口
inw function, Manipulating I/O ports
in_atomic 函数,内核定时器
in_atomic function, Kernel Timers
in_中断函数,内核定时器
in_interrupt function, Kernel Timers
ioctl 命令(创建),快速参考
ioctl commands (creating), Quick Reference
ioctl 函数(tty 驱动程序), ioctl
ioctl function (tty drivers), ioctls
ioctl 方法、文件操作重定向控制台消息ioctl 方法ioctl不使用 ioctl 的设备控制选择 ioctl 命令不使用 ioctl 的设备控制ioctl 方法设备方法自定义 ioctl 命令
ioctl method, File Operations, Redirecting Console Messages, The ioctl Method, ioctlDevice Control Without ioctl, Choosing the ioctl Commands, Device Control Without ioctl, The ioctl Method, The Device Methods, Custom ioctl Commands
使用位域定义命令,选择 ioctl 命令
块驱动程序,ioctl 方法
不使用 ioctl 控制设备,不使用 ioctl 的设备控制
自定义网络、自定义 ioctl 命令
使用ioctl 方法进行调试
网络设备和设备方法
TIOCLINUX 命令,重定向控制台消息
using bitfields to define commands, Choosing the ioctl Commands
block drivers, The ioctl Method
controlling devices without, Device Control Without ioctl
customizing for networking, Custom ioctl Commands
debugging with, The ioctl Method
network devices and, The Device Methods
TIOCLINUX command, Redirecting Console Messages
ioctl-number.txt 文件,选择 ioctl 命令
ioctl-number.txt file, Choosing the ioctl Commands
ioctl.h 头文件、选择 ioctl 命令选择 ioctl 命令快速参考
ioctl.h header file, Choosing the ioctl Commands, Choosing the ioctl Commands, Quick Reference
设置命令编号,选择 ioctl 命令
setting up command numbers, Choosing the ioctl Commands
_IOC_NRBITS 宏,快速参考快速参考快速参考快速参考
_IOC_NRBITS macro, Quick Reference, Quick Reference, Quick Reference, Quick Reference
IOMMU(I/O 内存管理单元)、地址类型DMA 映射
IOMMU (I/O memory management unit), Address Types, DMA mappings
ioremap、vmalloc 和朋友
ioremap, vmalloc and Friends
ioremap 函数、vmalloc 和朋友使用 I/O 内存快速参考
ioremap function, vmalloc and Friends, Using I/O Memory, Quick Reference
ioremap_nocache函数,I/O内存分配和映射
ioremap_nocache function, I/O Memory Allocation and Mapping
iounmap 函数、vmalloc 和朋友I/O 内存分配和映射
iounmap function, vmalloc and Friends, I/O Memory Allocation and Mapping
IP(互联网协议)、网络驱动程序
IP (Internet protocol), Network Drivers
IP 号解析为物理地址, MAC 地址解析
IP numbers resolving to physical addresses, MAC Address Resolution
ip_summed 字段 (sk_buff),数据包接收,重要字段
ip_summed field (sk_buff), Packet Reception, The Important Fields
irq 参数(中断号),安装中断处理程序
irq argument (interrupt number), Installing an Interrupt Handler
irq.h 头文件,DIY 探测
irq.h header file, Do-it-yourself probing
irqreturn_t 函数,安装中断处理程序
irqreturn_t function, Installing an Interrupt Handler
IRQ(中断请求线)、/proc 接口自动检测 IRQ 编号
IRQs (interrupt request lines), The /proc Interface, Autodetecting the IRQ Number
自动检测,自动检测IRQ号
/proc 接口的统计信息
autodetecting, Autodetecting the IRQ Number
statistics on, The /proc Interface
ISA、暂停 I/OISA 内存低于 1 MBISA 内存低于 1 MB回顾:ISA即插即用规范ISA 设备的 DMA与 DMA 控制器对话ISA 设备的 DMADMA对于 ISA 设备与 DMA 控制器对话
ISA, Pausing I/O, ISA Memory Below 1 MB, ISA Memory Below 1 MB, A Look Back: ISAThe Plug-and-Play Specification, DMA for ISA DevicesTalking to the DMA controller, DMA for ISA Devices, DMA for ISA DevicesTalking to the DMA controller
总线主控 DMA、ISA 设备的 DMA
devices, DMA for, DMA for ISA Devices与 DMA 控制器对话
I/O(暂停设备)、暂停 I/O
内存(访问)、ISA 内存低于 1 MBISA 内存低于 1 MBISA 设备的 DMA与 DMA 控制器对话
低于 1 MB,ISA 内存低于 1 MB
DMA for、DMA for ISA 设备与 DMA 控制器对话
PCI,回顾:ISA——即插即用规范
bus master DMA, DMA for ISA Devices
devices, DMA for, DMA for ISA DevicesTalking to the DMA controller
I/O (pausing devices), Pausing I/O
memory (access), ISA Memory Below 1 MB, ISA Memory Below 1 MB, DMA for ISA DevicesTalking to the DMA controller
below 1 MB, ISA Memory Below 1 MB
DMA for, DMA for ISA DevicesTalking to the DMA controller
PCI, A Look Back: ISAThe Plug-and-Play Specification
isa_readb 函数、isa_readb 和朋友
isa_readb function, isa_readb and Friends
等时端点 (USB)、端点
ISOCHRONOUS endpoints (USB), Endpoints
等时 urb (USB)、等时 urb
isochronous urbs (USB), Isochronous urbs
总线迭代、设备和驱动程序迭代
iteration of buses, Iterating over devices and drivers

J

J

jiffies,测量时间间隔,测量时间间隔,忙等待,短延迟,实用程序字段
jiffies, Measuring Time Lapses, Measuring Time Lapses, Busy waiting, Short Delays, Utility Fields
在忙等待实现中,忙等待
计数器,测量时间流逝
短延迟没有解决方案,短延迟
值、测量时间流逝实用领域
in busy-waiting implementation, Busy waiting
counters, Measuring Time Lapses
no solution for short delays, Short Delays
values, Measuring Time Lapses, Utility Fields
jit(及时)模块,了解当前时间忙等待
jit (just in time) module, Knowing the Current Time, Busy waiting
当前时间(检索),了解当前时间
延迟代码执行,忙等待
current time (retrieving), Knowing the Current Time
delaying code execution, Busy waiting
jitbusy 程序,忙等待
jitbusy program, Busy waiting
操纵杆(热插拔)、输入
joysticks (hotplugging), Input

K

K

kcore 文件,使用 gdb
kcore file, Using gdb
kdataalign程序,数据对齐
kdataalign program, Data Alignment
kdatasize 模块,标准 C 类型的使用
kdatasize module, Use of Standard C Types
内核辅助探测,内核辅助探测
kernel-assisted probing, Kernel-assisted probing
内核、设备驱动程序简介拆分内核设备和模块类安全问题版本编号-版本编号版本编号加入内核开发社区设置您的测试系统设置您的测试系统Hello世界模块内核模块与应用程序-其他一些细节内核模块与应用程序用户空间和内核空间用户空间和内核空间,内核中的并发性,当前进程,加载和卸载模块-平台依赖性,加载和卸载模块,版本依赖性,平台依赖性,内核符号表-内核符号表,预备知识,模块加载竞赛,一些重要的数据结构, inode 结构,读写,内核中的调试支持内核中的调试支持,通过打印进行调试打印设备编号、、通过查询进行调试ioctl 方法通过观察进行调试系统挂起调试器和相关工具动态探针kgdb 补丁并发及其管理并发及其管理信号量和互斥体Linux 信号量实现读取器/写入器信号量完成完成锁定陷阱细粒度锁定与粗粒度锁定 锁定的锁定的替代方案读取-复制-更新功能和受限操作独占等待测量时间间隔处理器特定的寄存器了解当前时间了解当前时间内核计时器内核计时器的实现TaskletTasklet工作队列共享队列计时内核计时器Tasklet工作队列安装中断处理程序x86 上中断处理的内部结构实现处理程序禁用所有中断使用标准 C 类型为数据项分配显式大小接口特定类型其他可移植性问题指针和错误值链接列表链接列表USB 和 SysfsUSB 和 SysfsUSB Urbs取消 Urbs编写 USB 驱动程序提交和控制 Urb无 Urbs 的 USB 传输其他 USB 数据功能Linux 设备模型Linux 设备模型Kobjects、Ksets 和子系统子系统低级 Sysfs 操作符号链接热插拔事件生成总线总线属性设备驱动程序结构嵌入类接口将它们放在一起删除驱动程序热插拔udev处理固件工作原理地址类型地址类型虚拟内存区域vm_area_struct 结构重新映射内核虚拟地址连接到内核实用程序字段内核对多播的支持
kernels, An Introduction to Device Drivers, Splitting the Kernel, Classes of Devices and Modules, Security Issues, Version NumberingVersion Numbering, Version Numbering, Joining the Kernel Development Community, Setting Up Your Test System, Setting Up Your Test System, The Hello World Module, Kernel Modules Versus ApplicationsA Few Other Details, Kernel Modules Versus Applications, User Space and Kernel Space, User Space and Kernel Space, Concurrency in the Kernel, The Current Process, Loading and Unloading ModulesPlatform Dependency, Loading and Unloading Modules, Version Dependency, Platform Dependency, The Kernel Symbol TableThe Kernel Symbol Table, Preliminaries, Module-Loading Races, Some Important Data Structures, The inode Structure, read and write, Debugging Support in the KernelDebugging Support in the Kernel, Debugging by PrintingPrinting Device Numbers, Debugging by QueryingThe ioctl Method, Debugging by Watching, System Hangs, Debuggers and Related ToolsDynamic Probes, The kgdb Patches, Concurrency and Its ManagementConcurrency and Its Management, Semaphores and Mutexes, The Linux Semaphore ImplementationReader/Writer Semaphores, CompletionsCompletions, Locking TrapsFine- Versus Coarse-Grained Locking, Alternatives to LockingRead-Copy-Update, Capabilities and Restricted Operations, Exclusive waits, Measuring Time LapsesProcessor-Specific Registers, Knowing the Current TimeKnowing the Current Time, Kernel TimersThe Implementation of Kernel Timers, TaskletsTasklets, WorkqueuesThe Shared Queue, Timekeeping, Kernel Timers, Tasklets, Workqueues, Installing an Interrupt HandlerThe internals of interrupt handling on the x86, Implementing a HandlerDisabling all interrupts, Use of Standard C Types, Assigning an Explicit Size to Data Items, Interface-Specific Types, Other Portability IssuesPointers and Error Values, Linked ListsLinked Lists, USB and SysfsUSB and Sysfs, USB UrbsCanceling Urbs, Writing a USB DriverSubmitting and Controlling a Urb, USB Transfers Without UrbsOther USB Data Functions, The Linux Device ModelThe Linux Device Model, Kobjects, Ksets, and SubsystemsSubsystems, Low-Level Sysfs OperationsSymbolic Links, Hotplug Event Generation, BusesBus attributes, DevicesDriver structure embedding, ClassesClass interfaces, Putting It All TogetherRemove a Driver, Hotplugudev, Dealing with FirmwareHow It Works, Address Types, Address Types, Virtual Memory AreasThe vm_area_struct structure, Remapping Kernel Virtual Addresses, Connecting to the KernelUtility Fields, Kernel Support for Multicasting
(另见模块)
应用程序(比较),内核模块与应用程序-一些其他细节
能力和受限操作,能力和受限操作
代码要求,预备知识
并发性、内核中的并发性并发性及其管理并发性及其管理信号量和互斥体Linux 信号量实现读取器/写入器信号量完成完成锁定陷阱细粒度锁定与粗粒度锁定锁定的替代方案读取-复制-更新
添加锁定、信号量和互斥体
锁定的替代方案,锁定的替代方案读取-复制-更新
锁定陷阱,锁定陷阱细粒度锁定与粗粒度锁定
并发及其管理的管理–并发及其管理
信号量补全,CompletionsCompletions
信号量实现,Linux 信号量实现读取器/写入器信号量
当前流程和当前流程
数据结构,一些重要的数据结构
中的数据类型、标准 C 类型的使用为数据项分配显式大小接口特定类型其他可移植性问题指针和错误值链接列表链接列表
分配显式大小,为数据项分配显式大小
接口特定的、接口特定的类型
链接列表,链接列表链接列表
可移植性、其他可移植性问题指针和错误值
标准 C 类型、标准 C 类型的使用
调试器、调试器和相关工具动态探针
开发社区,加入,加入内核开发社区
开发(实验)、版本编号
独家等待,独家等待
文件系统模块、设备和模块类
标头、内核模块与应用程序
索引节点结构,索引节点结构
中断,安装中断处理程序x86 上中断处理的内部结构实现处理程序禁用所有中断
实现处理程序,实现处理程序禁用所有中断
安装处理程序、安装中断处理程序x86 上中断处理的内部结构
简介,设备驱动程序简介
kgdb 补丁和kgdb 补丁
Linux 设备模型、Linux 设备模型Linux 设备模型Kobject、Kset 和子系统子系统低级 Sysfs 操作符号链接热插拔事件生成总线总线属性设备驱动程序结构嵌入类接口将它们放在一起删除驱动程序热插拔udev处理固件工作原理
总线、总线总线属性
类,——类接口
devices, Devices驱动程序结构嵌入
固件,处理固件它是如何工作的
热插拔、热插拔事件生成热插拔udev
kobjects、Kobjects、Ksets 和子系统子系统
生命周期,将它们放在一起删除驱动程序
低级 sysfs 操作,低级 Sysfs 操作符号链接
逻辑地址、地址类型
主线(安装),设置您的测试系统
消息,Hello World 模块
模块,加载和卸载模块平台依赖性加载和卸载模块
加载、加载和卸载模块——平台依赖
卸载、加载和卸载模块
监控、观察调试
多播支持,多播内核支持
网络驱动程序连接,连接到内核-实用程序字段
平台依赖性,平台依赖性
打印、通过打印进行调试打印设备编号
查询、通过查询进行调试ioctl 方法
安全,安全问题
空间、用户空间和内核空间用户空间和内核空间读写
支持,内核中的调试支持内核中的调试支持
符号,内核符号表内核符号表
系统挂起,系统挂起
小任务,小任务小任务小任务
测试系统设置,设置您的测试系统
时间、测量时间流逝处理器特定的寄存器了解当前时间了解当前时间计时
延时测量,测量延时-处理器特定寄存器
检索当前时间,了解当前时间了解当前时间
定时器,内核定时器内核定时器的实现内核定时器
USB、USB 和 SysfsUSB 和 SysfsUSB Urbs取消 Urbs编写 USB 驱动程序提交和控制 Urb没有 Urbs 的 USB 传输其他 USB 数据功能
sysfs 目录树、USB 和 SysfsUSB 和 Sysfs
无 urbs 传输、无 Urbs 的 USB 传输其他 USB 数据功能
urbs、USB Urbs取消 Urbs
编写,编写 USB 驱动程序提交和控制 Urb
版本、版本编号版本编号版本依赖性
依赖关系、版本依赖关系
编号、版本编号版本编号
查看、分割内核
虚拟地址、地址类型重新映射内核虚拟地址
VMA,虚拟内存区域vm_area_struct 结构
workqueues, Workqueues共享队列, Workqueues
(see also modules)
applications (comparisons to), Kernel Modules Versus ApplicationsA Few Other Details
capabilities and restricted operations, Capabilities and Restricted Operations
code requirements, Preliminaries
concurrency, Concurrency in the Kernel, Concurrency and Its ManagementConcurrency and Its Management, Semaphores and Mutexes, The Linux Semaphore ImplementationReader/Writer Semaphores, CompletionsCompletions, Locking TrapsFine- Versus Coarse-Grained Locking, Alternatives to LockingRead-Copy-Update
adding locking, Semaphores and Mutexes
alternatives to locking, Alternatives to LockingRead-Copy-Update
locking traps, Locking TrapsFine- Versus Coarse-Grained Locking
management of, Concurrency and Its ManagementConcurrency and Its Management
semaphore completion, CompletionsCompletions
semaphore implementation, The Linux Semaphore ImplementationReader/Writer Semaphores
current process and, The Current Process
data structures, Some Important Data Structures
data types in, Use of Standard C Types, Assigning an Explicit Size to Data Items, Interface-Specific Types, Other Portability IssuesPointers and Error Values, Linked ListsLinked Lists
assigning explicit sizes to, Assigning an Explicit Size to Data Items
interface-specific, Interface-Specific Types
linked lists, Linked ListsLinked Lists
portability, Other Portability IssuesPointers and Error Values
standard C types, Use of Standard C Types
debuggers, Debuggers and Related ToolsDynamic Probes
development community, joining, Joining the Kernel Development Community
developmental (experimental), Version Numbering
exclusive waits, Exclusive waits
filesystem modules, Classes of Devices and Modules
headers, Kernel Modules Versus Applications
inode structure, The inode Structure
interrupts, Installing an Interrupt HandlerThe internals of interrupt handling on the x86, Implementing a HandlerDisabling all interrupts
implementing handlers, Implementing a HandlerDisabling all interrupts
installing handlers, Installing an Interrupt HandlerThe internals of interrupt handling on the x86
introduction to, An Introduction to Device Drivers
kgdb patch and, The kgdb Patches
Linux device model, The Linux Device ModelThe Linux Device Model, Kobjects, Ksets, and SubsystemsSubsystems, Low-Level Sysfs OperationsSymbolic Links, Hotplug Event Generation, BusesBus attributes, DevicesDriver structure embedding, ClassesClass interfaces, Putting It All TogetherRemove a Driver, Hotplugudev, Dealing with FirmwareHow It Works
buses, BusesBus attributes
classes, ClassesClass interfaces
devices, DevicesDriver structure embedding
firmware, Dealing with FirmwareHow It Works
hotplugging, Hotplug Event Generation, Hotplugudev
kobjects, Kobjects, Ksets, and SubsystemsSubsystems
lifecycles, Putting It All TogetherRemove a Driver
low-level sysfs operations, Low-Level Sysfs OperationsSymbolic Links
logical addresses, Address Types
mainline (installation of), Setting Up Your Test System
messages, The Hello World Module
modules, Loading and Unloading ModulesPlatform Dependency, Loading and Unloading Modules
loading, Loading and Unloading ModulesPlatform Dependency
unloading, Loading and Unloading Modules
monitoring, Debugging by Watching
multicasting support, Kernel Support for Multicasting
network driver connections, Connecting to the KernelUtility Fields
platform dependency, Platform Dependency
printing, Debugging by PrintingPrinting Device Numbers
querying, Debugging by QueryingThe ioctl Method
security, Security Issues
space, User Space and Kernel Space, User Space and Kernel Space, read and write
support, Debugging Support in the KernelDebugging Support in the Kernel
symbols, The Kernel Symbol TableThe Kernel Symbol Table
system hangs, System Hangs
tasklets, TaskletsTasklets, Tasklets
test system setup, Setting Up Your Test System
time, Measuring Time LapsesProcessor-Specific Registers, Knowing the Current TimeKnowing the Current Time, Timekeeping
measurement of lapses, Measuring Time LapsesProcessor-Specific Registers
retrieving current time, Knowing the Current TimeKnowing the Current Time
timers, Kernel TimersThe Implementation of Kernel Timers, Kernel Timers
USB, USB and SysfsUSB and Sysfs, USB UrbsCanceling Urbs, Writing a USB DriverSubmitting and Controlling a Urb, USB Transfers Without UrbsOther USB Data Functions
sysfs directory trees, USB and SysfsUSB and Sysfs
transfers without urbs, USB Transfers Without UrbsOther USB Data Functions
urbs, USB UrbsCanceling Urbs
writing, Writing a USB DriverSubmitting and Controlling a Urb
versions, Version NumberingVersion Numbering, Version Dependency
dependency, Version Dependency
numbering, Version NumberingVersion Numbering
viewing, Splitting the Kernel
virtual addresses, Address Types, Remapping Kernel Virtual Addresses
VMAs, Virtual Memory AreasThe vm_area_struct structure
workqueues, WorkqueuesThe Shared Queue, Workqueues
kernel_ulong_t driver_info 字段(USB),驱动程序支持哪些设备?
kernel_ulong_t driver_info field (USB), What Devices Does the Driver Support?
KERNEL_VERSION 宏,版本依赖
KERNEL_VERSION macro, Version Dependency
KERN_ALERT 宏、printk
KERN_ALERT macro, printk
KERN_CRIT 宏、printk
KERN_CRIT macro, printk
KERN_DEBUG 宏、printk
KERN_DEBUG macro, printk
KERN_EMERG 宏、printk
KERN_EMERG macro, printk
KERN_ERR 宏、printk
KERN_ERR macro, printk
KERN_INFO 宏、printk
KERN_INFO macro, printk
KERN_NOTICE 宏、printk
KERN_NOTICE macro, printk
KERN_WARNING 宏、printk
KERN_WARNING macro, printk
键盘、系统挂起输入
keyboards, System Hangs, Input
调试时锁定,系统挂起
热插拔、输入
debugging when locked, System Hangs
hotplugging, Input
键(神奇的 SysRq),系统挂起
keys (magic SysRq), System Hangs
kfree、scull 的内存使用情况
kfree, scull's Memory Usage
kfree_skb 函数,作用于套接字缓冲区的函数
kfree_skb function, Functions Acting on Socket Buffers
kgdb 补丁, kgdb 补丁
kgdb patch, The kgdb Patches
杀死 urbs、取消 urbs
killing urbs, Canceling Urbs
Kill_fasync 函数,驱动程序的观点快速参考
kill_fasync function, The Driver's Point of View, Quick Reference
klogd 守护进程、Hello World 模块printk如何记录消息如何记录消息
klogd daemon, The Hello World Module, printk, How Messages Get Logged, How Messages Get Logged
记录消息、如何记录消息如何记录消息
logging messages, How Messages Get Logged, How Messages Get Logged
kmalloc、标志参数vmalloc 和朋友vmalloc 和朋友
kmalloc, The Flags Argument, vmalloc and Friends, vmalloc and Friends
标志参数,标志参数
返回虚拟地址、vmalloc 和 Friends
与 vmalloc、vmalloc 及其朋友相比
flags argument, The Flags Argument
returning virtual addresses, vmalloc and Friends
versus vmalloc, vmalloc and Friends
kmalloc 函数、scull 的内存使用kmalloc 的真实故事大小参数get_free_page 和朋友
kmalloc function, scull's Memory Usage, The Real Story of kmallocThe Size Argument, get_free_page and Friends
分配引擎,kmalloc 的真实故事大小争论
性能下降问题、get_free_page 和 Friends
allocation engine, The Real Story of kmallocThe Size Argument
performance degradation issues, get_free_page and Friends
kmap 函数,内存映射和结构页
kmap function, The Memory Map and Struct Page
kmap_skb_frag 函数,作用于套接字缓冲区的函数
kmap_skb_frag function, Functions Acting on Socket Buffers
kmem_cache_alloc 函数,后备缓存
kmem_cache_alloc function, Lookaside Caches
kmem_cache_create 函数,后备缓存
kmem_cache_create function, Lookaside Caches
kmem_cache_t 类型函数,Lookaside Caches
kmem_cache_t type function, Lookaside Caches
kmsg 文件,如何记录消息
kmsg file, How Messages Get Logged
kobjects、Kobjects、Ksets 和子系统子系统发布函数和 kobject 类型低级 Sysfs 操作符号链接默认属性非默认属性符号链接热插拔事件生成
kobjects, Kobjects, Ksets, and SubsystemsSubsystems, Release functions and kobject types, Low-Level Sysfs OperationsSymbolic Links, Default Attributes, Nondefault Attributes, Symbolic Links, Hotplug Event Generation
热插拔事件生成,热插拔事件生成
低级 sysfs 操作,低级 Sysfs 操作符号链接
非默认属性,非默认属性
释放函数、Release 函数和 kobject 类型
存储方法,默认属性
符号链接,符号链接
hotplug event generation, Hotplug Event Generation
low-level sysfs operations, Low-Level Sysfs OperationsSymbolic Links
nondefault attributes, Nondefault Attributes
release functions, Release functions and kobject types
store method, Default Attributes
symbolic links, Symbolic Links
ksets、Kobject 层次结构、Ksets 和子系统ksets 操作子系统
ksets, Kobject Hierarchies, Ksets, and Subsystems, Operations on ksets, Subsystems
对 kset 的操作
子系统,子系统
operations on, Operations on ksets
subsystems, Subsystems
kset_hotplug_ops 结构,热插拔操作
kset_hotplug_ops structure, Hotplug Operations
ksyms 文件,初始化和关闭
ksyms file, Initialization and Shutdown

L

L

时间流逝、测量、测量时间流逝处理器特定寄存器
lapses of time, measurement of, Measuring Time LapsesProcessor-Specific Registers
笔记本电脑扩展坞,笔记本电脑扩展坞
laptop docking stations, Laptop docking stations
大缓冲区、获取、获取大缓冲区快速参考
large buffers, obtaining, Obtaining Large Buffers, Quick Reference
大文件实现(/proc 文件),seq_file 接口
large file implementations (/proc files), The seq_file interface
层、内核符号表通用 DMA 层
layers, The Kernel Symbol Table, The Generic DMA Layer
通用 DMA,通用 DMA 层
模块化,内核符号表
generic DMA, The Generic DMA Layer
modularization, The Kernel Symbol Table
lddbus 驱动程序、总线方法
lddbus driver, Bus methods
ldd_driver结构体,Driver结构体嵌入
ldd_driver structure, Driver structure embedding
LED,焊接到输出引脚,示例驱动器
LEDs, soldering to output pins, A Sample Driver
级别、用户空间和内核空间打开和关闭消息打开和关闭消息
levels, User Space and Kernel Space, Turning the Messages On and Off, Turning the Messages On and Off
CPU(模式)、用户空间和内核空间
调试、打开和关闭消息打开和关闭消息
CPU (modalities), User Space and Kernel Space
debugging, Turning the Messages On and Off, Turning the Messages On and Off
库、内核模块与应用程序
libraries, Kernel Modules Versus Applications
许可条款,许可条款
license terms, License Terms
生命周期、USB UrbsLinux 设备模型综合起来删除驱动程序
lifecycles, USB Urbs, The Linux Device Model, Putting It All TogetherRemove a Driver
Linux 设备模型,将它们放在一起删除驱动程序
对象,Linux 设备模型
urbs, USB Urbs
Linux device model, Putting It All TogetherRemove a Driver
objects, The Linux Device Model
urbs, USB Urbs
调试消息的限制(prink 函数)、速率限制
limitations of debug messages (prink function), Rate Limiting
线路设置(tty 驱动程序)、TTY 线路设置
line settings (tty drivers), TTY Line Settings
线路状态寄存器 (LSR)、ioctls
line status register (LSR), ioctls
链接状态(变化),链接状态变化
link state (changes in), Changes in Link State
链表、链表
linked lists, Linked Lists
遍历链表
traversal of, Linked Lists
链接库、内核模块与应用程序
linking libraries, Kernel Modules Versus Applications
链接(符号)、符号链接
links (symbolic), Symbolic Links
Linux、版本编号许可条款
Linux, Version Numbering, License Terms
许可条款,许可条款
版本编号,版本编号
license terms, License Terms
version numbering, Version Numbering
Linux 设备模型、Linux 设备模型Linux 设备模型Kobject、Kset 和子系统子系统低级 Sysfs 操作符号链接热插拔事件生成总线总线属性设备驱动程序结构嵌入类接口将它们放在一起删除驱动程序热插拔udev处理固件工作原理
Linux device model, The Linux Device ModelThe Linux Device Model, Kobjects, Ksets, and SubsystemsSubsystems, Low-Level Sysfs OperationsSymbolic Links, Hotplug Event Generation, BusesBus attributes, DevicesDriver structure embedding, ClassesClass interfaces, Putting It All TogetherRemove a Driver, Hotplugudev, Dealing with FirmwareHow It Works
总线、总线总线属性
类,——类接口
devices, Devices驱动程序结构嵌入
固件,处理固件它是如何工作的
热插拔,热插拔udev
kobjects、Kobjects、Ksets 和子系统子系统低级 Sysfs 操作符号链接热插拔事件生成
热插拔事件,热插拔事件生成
低级 sysfs 操作,低级 Sysfs 操作符号链接
生命周期,将它们放在一起删除驱动程序
buses, BusesBus attributes
classes, ClassesClass interfaces
devices, DevicesDriver structure embedding
firmware, Dealing with FirmwareHow It Works
hotplugging, Hotplugudev
kobjects, Kobjects, Ksets, and SubsystemsSubsystems, Low-Level Sysfs OperationsSymbolic Links, Hotplug Event Generation
hotplug events, Hotplug Event Generation
low-level sysfs operations, Low-Level Sysfs OperationsSymbolic Links
lifecycles, Putting It All TogetherRemove a Driver
Linux 跟踪工具包 (LTT),Linux 跟踪工具包
Linux Trace Toolkit (LTT), The Linux Trace Toolkit
linux-kernel 邮件列表、加入内核开发社区链接列表
linux-kernel mailing list, Joining the Kernel Development Community, Linked Lists
LINUX_VERSION_CODE 宏、版本依赖版本依赖快速参考
LINUX_VERSION_CODE macro, Version Dependency, Version Dependency, Quick Reference
list.h 头文件,链接列表
list.h header file, Linked Lists
列表 (PCI),快速参考
lists (PCI), Quick Reference
list_add 函数,链接列表
list_add function, Linked Lists
list_add_tail 函数,链接列表
list_add_tail function, Linked Lists
list_del 函数,链表
list_del function, Linked Lists
list_empty 函数,链接列表
list_empty function, Linked Lists
list_entry 宏,链接列表
list_entry macro, Linked Lists
list_for_each 宏,链接列表
list_for_each macro, Linked Lists
list_head数据结构,链表
list_head data structure, Linked Lists
list_move 函数,链接列表
list_move function, Linked Lists
list_splice 函数,链表
list_splice function, Linked Lists
小端字节顺序,字节顺序
little-endian byte order, Byte Order
llseek 方法,文件操作,文件操作,寻找设备
llseek method, File Operations, File Operations, Seeking a Device
可加载模块,可加载模块
loadable modules, Loadable Modules
加载、加载和卸载模块平台依赖性模块加载竞赛模块参数模块参数主编号动态分配主编号动态分配工作原理
loading, Loading and Unloading ModulesPlatform Dependency, Module-Loading Races, Module ParametersModule Parameters, Dynamic Allocation of Major Numbers, Dynamic Allocation of Major Numbers, How It Works
属性(固件),工作原理
驱动程序,主要号码的动态分配
模块、加载和卸载模块平台依赖性模块加载竞赛模块参数模块参数主编号的动态分配
动态分配设备编号、动态分配主编号
参数,模块参数模块参数
竞赛、模块加载竞赛
attribute (firmware), How It Works
drivers, Dynamic Allocation of Major Numbers
modules, Loading and Unloading ModulesPlatform Dependency, Module-Loading Races, Module ParametersModule Parameters, Dynamic Allocation of Major Numbers
dynamically assigned device numbers, Dynamic Allocation of Major Numbers
parameters, Module ParametersModule Parameters
races, Module-Loading Races
local0(IP号码),分配IP号码
local0 (IP number), Assigning IP Numbers
LocalTalk 设备,设置接口信息字段
LocalTalk devices, setting up fields for, Interface Information
锁定方法,文件操作,文件操作
lock method, File Operations, File Operations
无锁算法,无锁算法
lock-free algorithms, Lock-Free Algorithms
键盘锁定(调试),系统挂起
locked keyboard (debugging), System Hangs
锁定、并发及其管理信号量和互斥体锁定陷阱细粒度锁定与粗粒度锁定锁定排序规则锁定替代方案读取-复制-更新原子变量seqlock
locking, Concurrency and Its Management, Semaphores and Mutexes, Locking TrapsFine- Versus Coarse-Grained Locking, Lock Ordering Rules, Alternatives to LockingRead-Copy-Update, Atomic Variables, seqlocks
添加、信号量和互斥体
锁定的替代方案–读取-复制-更新
原子变量,原子变量
规则,锁排序规则
序列锁,序列锁
陷阱、锁定陷阱细粒度锁定与粗粒度锁定
adding, Semaphores and Mutexes
alternatives to, Alternatives to LockingRead-Copy-Update
atomic variables, Atomic Variables
rules for, Lock Ordering Rules
seqlocks, seqlocks
traps, Locking TrapsFine- Versus Coarse-Grained Locking
Lockmeter 工具,细粒度锁定与粗粒度锁定
lockmeter tool, Fine- Versus Coarse-Grained Locking
loff_t(长偏移),文件操作文件结构
loff_t (long offset), File Operations, The file Structure
loff_t f_pos (struct file field),文件结构
loff_t f_pos (struct file field), The file Structure
记录消息(printk 函数),如何记录消息
logging messages (printk function), How Messages Get Logged
逻辑地址、地址类型
logical addresses, Address Types
逻辑单元 (USB)、配置
logical units (USB), Configurations
登录过程,设备文件的访问控制
login process, Access Control on a Device File
loglevels、Hello World 模块printk
loglevels, The Hello World Module, printk
消息优先级,Hello World 模块
message priorities, The Hello World Module
LOG_BUF_LEN 循环缓冲区,如何记录消息
LOG_BUF_LEN circular buffer, How Messages Get Logged
长数据类型,标准 C 类型的使用
long data type, Use of Standard C Types
(代码执行的)长延迟,长延迟
long delays (of code execution), Long Delays
Lookaside 缓存,Lookaside 缓存alloc_pages 接口快速参考
lookaside caches, Lookaside CachesThe alloc_pages Interface, Quick Reference
环回接口,snull 是如何设计的
loopback interfaces, How snull Is Designed
循环、系统挂起忙等待短延迟
loops, System Hangs, Busy waiting, Short Delays
忙碌,忙碌等待
无休无止,系统挂起
软件,短延迟
busy, Busy waiting
endless, System Hangs
software, Short Delays
Loops_per_jiffy 值,短延迟
loops_per_jiffy value, Short Delays
低内存、高内存和低内存
low memory, High and Low Memory
低级 sysfs 操作,低级 Sysfs 操作符号链接
low-level sysfs operations, Low-Level Sysfs OperationsSymbolic Links
ls 命令,识别设备类型、主要和次要编号
ls command, identifying device type, Major and Minor Numbers
LSR(线路状态寄存器)、ioctls
LSR (line status register), ioctls
ltalk_setup,接口信息
ltalk_setup, Interface Information
ltalk_setup函数,接口信息
ltalk_setup function, Interface Information
LTT (Linux Trace Toolkit),Linux 跟踪工具包
LTT (Linux Trace Toolkit), The Linux Trace Toolkit

中号

M

M68k 架构(移植和)、平台依赖性
M68k architecture (porting and), Platform Dependencies
MAC(介质访问控制)地址、初始化每个设备接口信息设备方法MAC 地址解析非以太网标头
MAC (medium access control) addresses, Initializing Each Device, Interface Information, The Device Methods, MAC Address ResolutionNon-Ethernet Headers
MAC 地址解析–非以太网标头
set_mac_address 方法和设备方法
resolution of, MAC Address ResolutionNon-Ethernet Headers
set_mac_address method and, The Device Methods
宏、Hello World 模块版本依赖性内核符号表设备编号的内部表示快速参考printkprintkprintkprintkprintkprintkprintkprintk完成简单睡眠手动睡眠快速参考, Tasklet ,链接列表,链接列表,链表配置寄存器和初始化配置寄存器和初始化MODULE_DEVICE_TABLE驱动程序支持哪些设备?,驱动程序支持哪些设备?,驱动程序支持哪些设备?,驱动程序支持哪些设备?,总线属性,驱动程序结构嵌入,物理地址和页面,内存映射和结构页面,分散/聚集映射,分散/聚集映射, set_termios
macros, The Hello World Module, Version Dependency, The Kernel Symbol Table, The Internal Representation of Device Numbers, Quick Reference, printk, printk, printk, printk, printk, printk, printk, printk, Completions, Simple Sleeping, Manual sleeps, Quick Reference, Tasklets, Linked Lists, Linked Lists, Linked Lists, Configuration Registers and Initialization, Configuration Registers and Initialization, MODULE_DEVICE_TABLE, What Devices Does the Driver Support?, What Devices Does the Driver Support?, What Devices Does the Driver Support?, What Devices Does the Driver Support?, Bus attributes, Driver structure embedding, Physical Addresses and Pages, The Memory Map and Struct Page, Scatter/gather mappings, Scatter/gather mappings, set_termios
BUS_ATTR,总线属性
竣工,竣工
DECLARE_TASKLET,小任务
DIVER_ATTR,驱动程序结构嵌入
你好世界模块,你好世界模块
INIT_LIST_HEAD,链接列表
设备编号的内部表示,设备编号的内部表示
ioctl 命令(创建),快速参考
KERN_ALERT,打印k
KERN_CRIT,打印k
KERN_DEBUG,打印k
KERN_EMERG,打印k
KERN_ERR,打印k
KERN_INFO,打印k
KERN_NOTICE,打印k
KERN_WARNING,打印k
list_entry,链接列表
list_for_each,链接列表
次要,快速参考
模块设备表、模块设备表
page_address、内存映射和结构页
PAGE_SHIFT,物理地址和页
PCI_DEVICE、配置寄存器和初始化
PCI_DEVICE_CLASS,配置寄存器和初始化
RELEVANT_IFLAG,设置_termios
sg_dma_address,分散/聚集映射
sg_dma_len,分散/聚集映射
符号,内核符号表
UBS_DEVICE_VER,驱动程序支持哪些设备?
USB_DEVICE,驱动程序支持哪些设备?
USB_DEVICE_INFO,驱动程序支持哪些设备?
USB_INTERFACE_INFO,驱动程序支持哪些设备?
版本依赖,版本依赖
等待队列,手动睡眠
等待事件,简单睡眠
BUS_ATTR, Bus attributes
completion, Completions
DECLARE_TASKLET, Tasklets
DIVER_ATTR, Driver structure embedding
hello world module, The Hello World Module
INIT_LIST_HEAD, Linked Lists
internal representation of device numbers, The Internal Representation of Device Numbers
ioctl commands (creating), Quick Reference
KERN_ALERT, printk
KERN_CRIT, printk
KERN_DEBUG, printk
KERN_EMERG, printk
KERN_ERR, printk
KERN_INFO, printk
KERN_NOTICE, printk
KERN_WARNING, printk
list_entry, Linked Lists
list_for_each, Linked Lists
MINOR, Quick Reference
MODULE_DEVICE_TABLE, MODULE_DEVICE_TABLE
page_address, The Memory Map and Struct Page
PAGE_SHIFT, Physical Addresses and Pages
PCI_DEVICE, Configuration Registers and Initialization
PCI_DEVICE_CLASS, Configuration Registers and Initialization
RELEVANT_IFLAG, set_termios
sg_dma_address, Scatter/gather mappings
sg_dma_len, Scatter/gather mappings
symbols, The Kernel Symbol Table
UBS_DEVICE_VER, What Devices Does the Driver Support?
USB_DEVICE, What Devices Does the Driver Support?
USB_DEVICE_INFO, What Devices Does the Driver Support?
USB_INTERFACE_INFO, What Devices Does the Driver Support?
version dependency, Version Dependency
wait queues, Manual sleeps
wait-event, Simple Sleeping
神奇的 SysRq 键,系统挂起
magic SysRq key, System Hangs
邮件列表、linux-kernel、加入内核开发社区
mailing list, linux-kernel, Joining the Kernel Development Community
主线内核、安装、设置测试系统
mainline kernels, installation of, Setting Up Your Test System
主要设备编号、主要编号和次要编号
major device numbers, Major and Minor Numbers
主要宏,快速参考
MAJOR macro, Quick Reference
主号码、主号码和次号码主号码动态分配主号码动态分配
major numbers, Major and Minor NumbersDynamic Allocation of Major Numbers, Dynamic Allocation of Major Numbers
字符驱动程序,主要号码和次要号码-主要号码的动态分配
动态分配,主要号码的动态分配
char drivers, Major and Minor NumbersDynamic Allocation of Major Numbers
dynamic allocation of, Dynamic Allocation of Major Numbers
make 命令,编译模块
make command, Compiling Modules
makefile、编译模块打开和关闭消息
makefiles, Compiling Modules, Turning the Messages On and Off
printk 函数,打开和关闭消息
printk function, Turning the Messages On and Off
管理、分割内核分割内核分割内核分割内核分割内核安全问题scull 的内存使用scull 的内存使用scull 中的陷阱并发及其管理并发及其管理锁定陷阱细粒度锁定与粗粒度锁定锁定的替代方案读取-复制-更新TaskletTasklet大小参数 –、I/O 端口和 I/O 内存isa_readb 和朋友快速参考Linux 设备模型管理类Linux 中的内存管理高内存和低内存内存映射和结构页内存映射和结构页页表虚拟内存区域vm_area_struct 结构进程内存映射mmap 设备操作重新映射内核虚拟地址执行直接 I/O异步 I/O 示例直接内存访问与 DMA 控制器对话DIY 分配直接内存访问
management, Splitting the Kernel, Splitting the Kernel, Splitting the Kernel, Splitting the Kernel, Splitting the Kernel, Security Issues, scull's Memory Usagescull's Memory Usage, Pitfalls in scull, Concurrency and Its ManagementConcurrency and Its Management, Locking TrapsFine- Versus Coarse-Grained Locking, Alternatives to LockingRead-Copy-Update, TaskletsTasklets, The Size Argument, I/O Ports and I/O Memoryisa_readb and Friends, Quick Reference, The Linux Device Model, Managing classes, Memory Management in LinuxHigh and Low Memory, The Memory Map and Struct PageThe Memory Map and Struct Page, Page Tables, Virtual Memory AreasThe vm_area_struct structure, The Process Memory Map, The mmap Device OperationRemapping Kernel Virtual Addresses, Performing Direct I/OAn asynchronous I/O example, Direct Memory AccessTalking to the DMA controller, Do-it-yourself allocation, Direct Memory Access
班级、班级管理
并发性、并发性及其管理并发性及其管理锁定陷阱细粒度锁定与粗粒度锁定锁定的替代方案读取-复制-更新
锁定的替代方案,锁定的替代方案读取-复制-更新
锁定陷阱,锁定陷阱细粒度锁定与粗粒度锁定
分片、DIY分配
硬件(I/O 端口和 I/O 内存)、I/O 端口和 I/O 内存isa_readb 和朋友
中断处理程序,快速参考
内存、分割内核scull 的内存使用scull 的内存使用scull 中的陷阱Linux 中的内存管理高内存和低内存内存映射和结构页内存映射和结构页页表虚拟内存区域vm_area_struct 结构进程内存映射mmap 设备操作重新映射内核虚拟地址执行直接 I/O异步 I/O 示例直接内存访问与 DMA 控制器对话直接内存访问
直接 I/O,执行直接 I/O异步 I/O 示例
DMA,直接内存访问与 DMA 控制器对话直接内存访问
映射、内存映射和结构页内存映射和结构页
mmap 设备操作,mmap 设备操作重新映射内核虚拟地址
页表,页表
进程内存映射,进程内存映射
scull, scull 的内存使用情况scull 的内存使用情况, scull 中的陷阱
VMA,虚拟内存区域vm_area_struct 结构
网络,分裂内核
物理内存,大小参数
电源,Linux 设备模型
进程,分裂内核,分裂内核
安全,安全问题
小任务,小任务小任务
classes, Managing classes
concurrency, Concurrency and Its ManagementConcurrency and Its Management, Locking TrapsFine- Versus Coarse-Grained Locking, Alternatives to LockingRead-Copy-Update
alternatives to locking, Alternatives to LockingRead-Copy-Update
locking traps, Locking TrapsFine- Versus Coarse-Grained Locking
fragmentation, Do-it-yourself allocation
hardware (I/O ports and I/O memory), I/O Ports and I/O Memoryisa_readb and Friends
interrupt handlers, Quick Reference
memory, Splitting the Kernel, scull's Memory Usagescull's Memory Usage, Pitfalls in scull, Memory Management in LinuxHigh and Low Memory, The Memory Map and Struct PageThe Memory Map and Struct Page, Page Tables, Virtual Memory AreasThe vm_area_struct structure, The Process Memory Map, The mmap Device OperationRemapping Kernel Virtual Addresses, Performing Direct I/OAn asynchronous I/O example, Direct Memory AccessTalking to the DMA controller, Direct Memory Access
direct I/O, Performing Direct I/OAn asynchronous I/O example
DMA, Direct Memory AccessTalking to the DMA controller, Direct Memory Access
mapping, The Memory Map and Struct PageThe Memory Map and Struct Page
mmap device operations, The mmap Device OperationRemapping Kernel Virtual Addresses
page tables, Page Tables
process memory maps, The Process Memory Map
scull, scull's Memory Usagescull's Memory Usage, Pitfalls in scull
VMAs, Virtual Memory AreasThe vm_area_struct structure
networks, Splitting the Kernel
physical memory, The Size Argument
power, The Linux Device Model
process, Splitting the Kernel, Splitting the Kernel
security, Security Issues
tasklets, TaskletsTasklets
手动休眠,手动休眠
manual sleeps, Manual sleeps
映射器程序,重新映射 RAM
mapper program, Remapping RAM
映射、I/O 内存分配和映射I/O 内存分配和映射快速参考内存映射和结构页内存映射和结构页进程内存映射mmap 设备操作重新映射内核虚拟地址mmap 设备操作DMA 映射DMA 映射设置流式 DMA 映射设置流式 DMA 映射单页流式映射分散/聚集映射分散/聚集映射分散/聚集映射PCI 双地址循环映射
mapping, I/O Memory Allocation and Mapping, I/O Memory Allocation and Mapping, Quick Reference, The Memory Map and Struct PageThe Memory Map and Struct Page, The Process Memory Map, The mmap Device OperationRemapping Kernel Virtual Addresses, The mmap Device Operation, DMA mappings, DMA mappings, Setting up streaming DMA mappings, Setting up streaming DMA mappings, Single-page streaming mappings, Scatter/gather mappings, Scatter/gather mappings, Scatter/gather mappings, PCI double-address cycle mappings
删除、设置流 DMA 映射
DMA、DMA 映射
I/O、I/O 内存分配和映射快速参考
内存、内存映射和结构页内存映射和结构页进程内存映射mmap 设备操作重新映射内核虚拟地址
mmap 设备操作,mmap 设备操作重新映射内核虚拟地址
进程内存映射,进程内存映射
PCI双地址周期、PCI双地址周期映射
寄存器、DMA 映射分散/聚集映射
分散/聚集 DMA、分散/聚集映射
分散列表和分散/聚集映射
单页流式传输、单页流式传输映射
软件映射内存、I/O 内存分配和映射
流 DMA 配置,设置流 DMA 映射
显存、mmap设备操作
deleting, Setting up streaming DMA mappings
DMA, DMA mappings
I/O, I/O Memory Allocation and Mapping, Quick Reference
memory, The Memory Map and Struct PageThe Memory Map and Struct Page, The Process Memory Map, The mmap Device OperationRemapping Kernel Virtual Addresses
mmap device operations, The mmap Device OperationRemapping Kernel Virtual Addresses
process memory maps, The Process Memory Map
PCI double-address cycle, PCI double-address cycle mappings
registers, DMA mappings, Scatter/gather mappings
scatter-gather DMA, Scatter/gather mappings
scatterlists and, Scatter/gather mappings
single-page streaming, Single-page streaming mappings
software-mapped memory, I/O Memory Allocation and Mapping
streaming DMA configuration, Setting up streaming DMA mappings
video memory, The mmap Device Operation
匹配函数(总线)、总线方法
match function (buses), Bus methods
MCA(微通道架构)、MCA
MCA (Micro Channel Architecture), MCA
mdelay,短延迟
mdelay, Short Delays
时间流逝的测量,测量时间流逝——处理器特定的寄存器
measurement of time lapses, Measuring Time LapsesProcessor-Specific Registers
媒体独立接口 (MII)、媒体独立接口支持
Media Independent Interface (MII), Media Independent Interface Support
media_changed 方法,支持可移动媒体
media_changed method, Supporting Removable Media
内存、分割内核加载和卸载模块scull 的设计scull 的设计scull 的设计scull 的内存使用scull 的内存使用、 scull 中的陷阱信号量和互斥体、使用ioctl 参数真实的故事kmalloc大小参数标志参数内存区域、Lookaside缓存alloc_pages 接口Lookaside 缓存、、内存池 get_free_page 和朋友, get_free_page 和朋友, vmalloc 和朋友使用虚拟地址的 scull : scullv ,每 CPU 变量每 CPU 变量,获取大缓冲区,快速参考,快速参考,快速参考,快速参考, I/O端口和 I/O 内存isa_readb 和朋友I/O 寄存器和常规内存I/O 寄存器和常规内存I/O 寄存器和常规内存I/O 内存分配和映射, I/O 内存分配和映射, 1 MB 以下 ISA 内存,快速参考,快速参考,快速参考,页大小, PCI 寻址,访问 I/O 和内存空间,内存映射和 DMA , Linux 中的内存管理高内存和低内存,高内存和低内存,高内存和低内存,高内存和低内存,内存映射和结构页内存映射和结构页,页表虚拟内存区域vm_area_struct 结构进程内存映射mmap 设备操作重新映射内核虚拟地址重新映射 RAM执行直接 I/O异步 I/O 示例执行直接 I/O直接内存访问与 DMA 控制器对话DIY 分配直接内存访问sbull 中的初始化硬件信息
memory, Splitting the Kernel, Loading and Unloading Modules, The Design of scull, The Design of scull, The Design of scull, scull's Memory Usagescull's Memory Usage, Pitfalls in scull, Semaphores and Mutexes, Using the ioctl Argument, The Real Story of kmallocThe Size Argument, The Flags Argument, Memory zones, Lookaside CachesThe alloc_pages Interface, Lookaside Caches, Memory Pools, get_free_page and Friends, get_free_page and Friends, vmalloc and FriendsA scull Using Virtual Addresses: scullv, Per-CPU VariablesPer-CPU Variables, Obtaining Large Buffers, Quick Reference, Quick Reference, Quick Reference, Quick Reference, I/O Ports and I/O Memoryisa_readb and Friends, I/O Registers and Conventional Memory, I/O Registers and Conventional Memory, I/O Registers and Conventional Memory, I/O Memory Allocation and Mapping, I/O Memory Allocation and Mapping, ISA Memory Below 1 MB, Quick Reference, Quick Reference, Quick Reference, Page Size, PCI Addressing, Accessing the I/O and Memory Spaces, Memory Mapping and DMA, Memory Management in LinuxHigh and Low Memory, High and Low Memory, High and Low Memory, High and Low Memory, The Memory Map and Struct PageThe Memory Map and Struct Page, Page Tables, Virtual Memory AreasThe vm_area_struct structure, The Process Memory Map, The mmap Device OperationRemapping Kernel Virtual Addresses, Remapping RAM, Performing Direct I/OAn asynchronous I/O example, Performing Direct I/O, Direct Memory AccessTalking to the DMA controller, Do-it-yourself allocation, Direct Memory Access, Initialization in sbull, Hardware Information
分配,kmalloc 的真实故事大小参数标志参数Lookaside 缓存alloc_pages 接口Lookaside 缓存get_free_page 和朋友get_free_page 和朋友vmalloc 和朋友使用虚拟地址的 scull: scullv每 CPU 变量每 CPU 变量获取大缓冲区快速参考快速参考快速参考I/O 内存分配和映射快速参考
启动时间、获取大缓冲区快速参考
flags、标志参数后备缓存快速参考
I/O、I/O 内存分配和映射快速参考
kmalloc 分配引擎,kmalloc 的真实故事大小争论
Lookaside 缓存,Lookaside 缓存alloc_pages 接口快速参考
按页面、get_free_page 和好友
每 CPU 变量、每 CPU 变量每 CPU 变量
性能下降问题、get_free_page 和 Friends
vmalloc 分配函数、vmalloc 和朋友使用虚拟地址的 scull:scullv
屏障、I/O 寄存器和传统存储器I/O 寄存器和传统存储器快速参考
块驱动程序,在 sbull 中初始化
DMA、内存映射和 DMA(请参阅 DMA)
全局区域,双桨的设计
硬件、硬件信息
高、高、低内存
I/O、I/O 端口和 I/O 内存isa_readb 和朋友快速参考
ISA, 1 MB 以下的 ISA 内存
使用权, ISA 内存低于 1 MB
的限制,高内存和低内存
锁定,信号量和互斥体
低的,高内存和低内存
管理、分割内核Linux 中的内存管理高内存和低内存内存映射和结构页内存映射和结构页页表虚拟内存区域vm_area_struct 结构进程内存映射mmap 设备操作重新映射内核虚拟地址执行直接 I/O异步 I/O 示例直接内存访问与 DMA 控制器对话自行分配直接内存访问
直接 I/O,执行直接 I/O异步 I/O 示例
DMA,直接内存访问与 DMA 控制器对话直接内存访问
碎片化,DIY分配
映射、内存映射和结构页内存映射和结构页
mmap 设备操作,mmap 设备操作重新映射内核虚拟地址
页表,页表
处理内存映射,进程内存映射
VMA,虚拟内存区域vm_area_struct 结构
模块(加载),加载和卸载模块
页面大小和可移植性,页面大小
PCI、PCI 寻址访问 I/O 和内存空间
坚持,双桨的设计
池、内存池快速参考
重新映射 RAM,重新映射 RAM
scull, scull 的设计, scull 的内存使用scull 的内存使用, scull 中的陷阱
设计,双桨的设计
故障排除,scull 中的陷阱
scull 的内存使用情况– scull的内存使用情况
软件映射(和 ioremap 函数), I/O 内存分配和映射
用户空间,执行直接 I/O
验证用户空间地址,使用 ioctl 参数
与 I/O 寄存器、I/O 寄存器和传统存储器的比较
区域、内存区域
allocation, The Real Story of kmallocThe Size Argument, The Flags Argument, Lookaside CachesThe alloc_pages Interface, Lookaside Caches, get_free_page and Friends, get_free_page and Friends, vmalloc and FriendsA scull Using Virtual Addresses: scullv, Per-CPU VariablesPer-CPU Variables, Obtaining Large Buffers, Quick Reference, Quick Reference, Quick Reference, I/O Memory Allocation and Mapping, Quick Reference
boot time, Obtaining Large Buffers, Quick Reference
flags, The Flags Argument, Lookaside Caches, Quick Reference
I/O, I/O Memory Allocation and Mapping, Quick Reference
kmalloc allocation engine, The Real Story of kmallocThe Size Argument
lookaside caches, Lookaside CachesThe alloc_pages Interface, Quick Reference
by page, get_free_page and Friends
per-CPU variables, Per-CPU VariablesPer-CPU Variables
performance degradation issues, get_free_page and Friends
vmalloc allocation function, vmalloc and FriendsA scull Using Virtual Addresses: scullv
barriers, I/O Registers and Conventional Memory, I/O Registers and Conventional Memory, Quick Reference
block drivers, Initialization in sbull
DMA, Memory Mapping and DMA (see DMA)
global areas, The Design of scull
hardware, Hardware Information
high, High and Low Memory
I/O, I/O Ports and I/O Memoryisa_readb and Friends, Quick Reference
ISA, ISA Memory Below 1 MB
access, ISA Memory Below 1 MB
limitations on, High and Low Memory
locking, Semaphores and Mutexes
low, High and Low Memory
management, Splitting the Kernel, Memory Management in LinuxHigh and Low Memory, The Memory Map and Struct PageThe Memory Map and Struct Page, Page Tables, Virtual Memory AreasThe vm_area_struct structure, The Process Memory Map, The mmap Device OperationRemapping Kernel Virtual Addresses, Performing Direct I/OAn asynchronous I/O example, Direct Memory AccessTalking to the DMA controller, Do-it-yourself allocation, Direct Memory Access
direct I/O, Performing Direct I/OAn asynchronous I/O example
DMA, Direct Memory AccessTalking to the DMA controller, Direct Memory Access
fragmentation, Do-it-yourself allocation
mapping, The Memory Map and Struct PageThe Memory Map and Struct Page
mmap device operations, The mmap Device OperationRemapping Kernel Virtual Addresses
page tables, Page Tables
process memory maps, The Process Memory Map
VMAs, Virtual Memory AreasThe vm_area_struct structure
modules (loading), Loading and Unloading Modules
page size and portability, Page Size
PCI, PCI Addressing, Accessing the I/O and Memory Spaces
persistence, The Design of scull
pools, Memory Pools, Quick Reference
remapping RAM, Remapping RAM
scull, The Design of scull, scull's Memory Usagescull's Memory Usage, Pitfalls in scull
design of, The Design of scull
troubleshooting, Pitfalls in scull
usage, scull's Memory Usagescull's Memory Usage
software-mapped (and ioremap function), I/O Memory Allocation and Mapping
user space, Performing Direct I/O
verifying user-space addresses, Using the ioctl Argument
versus I/O registers, I/O Registers and Conventional Memory
zones, Memory zones
内存管理,进程内存映射进程内存映射
memory management, The Process Memory Map, The Process Memory Map
进程内存映射理论
VMA,进程内存映射
theory of, The Process Memory Map
VMAs, The Process Memory Map
消息、Hello World 模块Hello World 模块printk重定向控制台消息如何记录消息打开和关闭消息打开和关闭消息速率限制Oops 消息Oops 消息
messages, The Hello World Module, The Hello World Module, printk, Redirecting Console Messages, How Messages Get Logged, Turning the Messages On and Off, Turning the Messages On and Off, Rate Limiting, Oops MessagesOops Messages
控制台,重定向控制台消息
调试、打开和关闭消息速率限制
禁用、打开和关闭消息
(printk 函数)的限制,速率限制
全局启用/禁用,打开和关闭消息
内核,Hello World 模块
日志记录,如何记录消息
哎呀,哎呀消息哎呀消息
Hello World 模块printk的优先级(日志级别)
consoles, Redirecting Console Messages
debug, Turning the Messages On and Off, Rate Limiting
disabling, Turning the Messages On and Off
limitation of (printk function), Rate Limiting
globally enabling/disabling, Turning the Messages On and Off
kernels, The Hello World Module
logging, How Messages Get Logged
oops, Oops MessagesOops Messages
priorities (loglevels) of, The Hello World Module, printk
方法、文件操作文件操作文件操作文件操作文件操作文件操作文件操作文件操作、文件操作、文件操作、文件操作、文件操作文件操作文件操作文件操作文件操作文件操作,文件操作,文件操作,文件操作文件操作文件操作文件操作文件结构文件结构文件结构open 方法- open 方法release 方法release 方法release 方法release 方法读取和写读和写读和写读和写读和写读方法读方法, write 方法, write 方法, readv 和 writev , readv 和 writev ,在 /proc 中实现文件, seq_file 接口, seq_file 接口, seq_file 接口, seq_file 接口, ioctl 方法,通过观察进行调试,通过观察进行调试, Oops 消息, Oops 消息,系统挂起,自旋锁函数,原子变量,原子变量,原子变量,原子变量,原子变量,原子变量,原子变量,原子变量,原子变量,操作,位操作,位操作,位操作 , 位操作,位操作,位操作,位操作, ioctl设备控制无ioctlioctl阻塞和非阻塞操作轮询和选择底层数据结构轮询和选择底层数据结构从设备读取数据写入设备刷新挂起输出刷新挂起输出寻找设备单开设备限制对单个用户的访问a Time阻止打开作为 EBUSY 的替代方案阻止打开作为 EBUSY 的替代方案在打开时克隆设备字符串操作快速参考释放函数和 kobject 类型默认属性默认属性热插拔操作热插拔操作、总线方法vm_area_struct 结构vm_area_struct 结构vm_area_struct 结构vm_area_struct 结构mmap 设备操作重新映射内核虚拟地址映射内存nopage使用 nopage 方法重新映射 RAM异步 I/O注册 DMA 使用与 DMA 控制器对话块设备操作打开和释放方法打开和释放方法支持可移动媒体支持可移动媒体ioctl 方法ioctl 方法接口信息设备方法设备方法设备方法设备方法设备方法设备方法设备方法设备方法设备方法,设备方法,设备方法,设备方法,设备方法,设备方法,设备方法,设备方法,设备方法,数据包传输,数据包传输,在以太网中使用 ARP ,自定义 ioctl 命令,自定义 ioctl 命令,统计信息,多播的内核支持, Netpoll
methods, File OperationsFile Operations, File Operations, File Operations, File Operations, File Operations, File Operations, File Operations, File Operations, File Operations, File Operations, File Operations, File Operations, File Operations, File Operations, File Operations, File Operations, File Operations, File Operations, File Operations, File Operations, File Operations, File Operations, The file Structure, The file Structure, The file Structure, The open MethodThe open Method, The release Method, The release Method, The release Method, The release Method, read and write, read and write, read and write, read and write, read and write, The read Method, The read Method, The write Method, The write Method, readv and writev, readv and writev, Implementing files in /proc, The seq_file interface, The seq_file interface, The seq_file interface, The seq_file interface, The ioctl Method, Debugging by Watching, Debugging by Watching, Oops Messages, Oops Messages, System Hangs, The Spinlock Functions, Atomic Variables, Atomic Variables, Atomic Variables, Atomic Variables, Atomic Variables, Atomic Variables, Atomic Variables, Atomic Variables, Atomic Variables, Bit Operations, Bit Operations, Bit Operations, Bit Operations, Bit Operations, Bit Operations, Bit Operations, Bit Operations, ioctlDevice Control Without ioctl, ioctl, Blocking and Nonblocking Operations, poll and selectThe Underlying Data Structure, poll and selectThe Underlying Data Structure, Reading data from the device, Writing to the device, Flushing pending output, Flushing pending output, Seeking a Device, Single-Open Devices, Restricting Access to a Single User at a Time, Blocking open as an Alternative to EBUSY, Blocking open as an Alternative to EBUSY, Cloning the Device on open, String Operations, Quick Reference, Release functions and kobject types, Default Attributes, Default Attributes, Hotplug Operations, Hotplug Operations, Bus methods, The vm_area_struct structure, The vm_area_struct structure, The vm_area_struct structure, The vm_area_struct structure, The mmap Device OperationRemapping Kernel Virtual Addresses, Mapping Memory with nopage, Remapping RAM with the nopage method, Asynchronous I/O, Registering DMA usage, Talking to the DMA controller, Block device operations, The open and release Methods, The open and release Methods, Supporting Removable Media, Supporting Removable Media, The ioctl Method, The ioctl Method, Interface Information, The Device Methods, The Device Methods, The Device Methods, The Device Methods, The Device Methods, The Device Methods, The Device Methods, The Device Methods, The Device Methods, The Device Methods, The Device Methods, The Device Methods, The Device Methods, The Device Methods, The Device Methods, The Device Methods, The Device Methods, Packet Transmission, Packet Transmission, Using ARP with Ethernet, Custom ioctl Commands, Custom ioctl Commands, Statistical Information, Kernel Support for Multicasting, Netpoll
block_fsync,刷新待处理输出
巴士,巴士方法
change_mtu,设备方法
check_flags,文件操作
close、release方法vm_area_struct结构体
设备,设备方法
*dir_notify,文件操作
do_ioctl、设备方法自定义 ioctl 命令
fasync,文件操作
flush、文件操作release方法
fsync、文件操作刷新挂起输出
get_stats,设备方法统计信息
Hard_header,设备方法在以太网中使用 ARP
hard_start_transmit,数据包传输
hard_start_xmit,设备方法数据包传输
header_cache,设备方法
header_cache_update,设备方法
ioctl、文件操作ioctl 方法ioctl无需 ioctl 的设备控制、 ioctlioctl 方法自定义 ioctl 命令
块驱动程序,ioctl 方法
自定义网络、自定义 ioctl 命令
使用ioctl 方法进行调试
inode 指针输入、ioctl
llseek、文件操作寻找设备
锁定、文件操作
media_changed,支持可移动媒体
mmap,文件操作
接下来,seq_file接口
nopage、vm_area_struct 结构使用 nopage 映射内存使用 nopage 方法重新映射 RAM
open、文件操作文件结构open 方法open 方法单次打开设备限制一次对单个用户的访问阻止 open 作为 EBUSY 的替代方案vm_area_struct 结构注册 DMA 使用open和释放方法设备方法
块驱动程序的打开和释放方法
阻塞,阻塞打开作为 EBUSY 的替代方案
对于网络设备,设备方法
private_data 和文件结构
请求 DMA 通道、注册 DMA 使用
限制并发用户以及限制一次单个用户的访问
用于单开设备、单开设备
vm_operations_struct 结构、vm_area_struct 结构
操作、文件操作文件操作文件操作readv 和 writev系统挂起自旋锁函数原子变量、原子变量原子变量原子变量原子变量、原子变量、原子变量原子变量原子变量位运算,位运算,位运算,位运算,位操作,位操作,位操作 , 位操作,阻塞和非阻塞操作,字符串操作,快速参考,热插拔操作,热插拔操作, mmap 设备操作重新映射内核虚拟地址,异步 I/O ,块设备操作,设备方法
aio_fsync,异步I/O
atomic_add,原子变量
atomic_dec,原子变量
atomic_dec_and_test,原子变量
atomic_inc,原子变量
atomic_inc_and_test,原子变量
atomic_read,原子变量
atomic_set,原子变量
atomic_sub,原子变量
atomic_sub_and_test,原子变量
位、位运算
块驱动程序、块设备操作
阻塞/非阻塞、阻塞和非阻塞操作
Change_bit,位操作
clear_bit,位操作
设备,设备方法
文件,文件操作文件操作
过滤热插拔、热插拔操作
刷新,文件操作
热插拔、热插拔操作
mmap devices,mmap 设备操作重新映射内核虚拟地址
set_bit,位操作
自旋锁,自旋锁函数
字符串、字符串操作快速参考
sysrq,系统挂起
test_and_change_bit,位操作
test_and_clear_bit,位操作
test_and_set_bit,位操作
test_bit,位操作
矢量、readv 和 writev
poll、文件操作poll 和 select底层数据结构设备方法
轮询控制器、网络轮询
填充,vm_area_struct 结构
读、读、写
proc_read,在/proc中实现文件
pwrite,读和写
读取、文件操作文件结构读取和写入读取和写入读取方法读取方法通过观察进行调试哎呀消息从设备读取数据与 DMA 控制器对话
参数、读取和写入
read 方法的代码
配置 DMA 控制器、与 DMA 控制器对话
f_pso 字段(文件结构)和,文件结构
哎呀消息,哎呀消息
poll方法以及,从设备读取数据
解释返回值的规则,read 方法
strace 命令和通过观察进行调试
readdir,文件操作
readv,文件操作
rebuild_header,设备方法
release、文件操作release 方法release 方法阻塞 open 作为 EBUSY 的替代方案在 open 上克隆设备release 函数和 kobject 类型open 和 release 方法
块驱动程序的打开和释放方法
阻塞,阻塞打开作为 EBUSY 的替代方案
克隆设备,在打开时克隆设备
kobject、Release 函数和 kobject 类型
重新验证,支持可移动媒体
sbull ioctl,ioctl 方法
select、poll 和 select——底层数据结构
选择、轮询方法和文件操作
set_config,设备方法
set_mac_address,设备方法
set_multicast_list、接口信息设备方法多播的内核支持
显示,seq_file 接口默认属性
kobjects,默认属性
seq_file 接口, seq_file 接口
开始,seq_file接口
停止,设备方法
存储(kobjects),默认属性
strace 命令和通过观察进行调试
struct module *owner,文件操作
tx_timeout,设备方法
unsigned long,文件操作
write、文件操作文件结构读和写write 方法write 方法Oops 消息写入设备
write 方法的代码
f_pos 字段(文件结构)和,文件结构
返回值的解释规则,write 方法
哎呀消息,哎呀消息
poll 方法以及写入设备
writev、文件操作readv 和 writev
block_fsync, Flushing pending output
buses, Bus methods
change_mtu, The Device Methods
check_flags, File Operations
close, The release Method, The vm_area_struct structure
devices, The Device Methods
*dir_notify, File Operations
do_ioctl, The Device Methods, Custom ioctl Commands
fasync, File Operations
flush, File Operations, The release Method
fsync, File Operations, Flushing pending output
get_stats, The Device Methods, Statistical Information
hard_header, The Device Methods, Using ARP with Ethernet
hard_start_transmit, Packet Transmission
hard_start_xmit, The Device Methods, Packet Transmission
header_cache, The Device Methods
header_cache_update, The Device Methods
ioctl, File Operations, The ioctl Method, ioctlDevice Control Without ioctl, ioctl, The ioctl Method, Custom ioctl Commands
block drivers, The ioctl Method
customizing for networking, Custom ioctl Commands
debugging with, The ioctl Method
inode pointer in, ioctl
llseek, File Operations, Seeking a Device
lock, File Operations
media_changed, Supporting Removable Media
mmap, File Operations
next, The seq_file interface
nopage, The vm_area_struct structure, Mapping Memory with nopage, Remapping RAM with the nopage method
open, File Operations, The file Structure, The open MethodThe open Method, Single-Open Devices, Restricting Access to a Single User at a Time, Blocking open as an Alternative to EBUSY, The vm_area_struct structure, Registering DMA usage, The open and release Methods, The Device Methods
block drivers, The open and release Methods
blocking, Blocking open as an Alternative to EBUSY
for network devices, The Device Methods
private_data and, The file Structure
requesting DMA channels, Registering DMA usage
restricting simultaneous users and, Restricting Access to a Single User at a Time
for single-open devices, Single-Open Devices
vm_operations_struct structure, The vm_area_struct structure
operations, File OperationsFile Operations, File Operations, readv and writev, System Hangs, The Spinlock Functions, Atomic Variables, Atomic Variables, Atomic Variables, Atomic Variables, Atomic Variables, Atomic Variables, Atomic Variables, Atomic Variables, Atomic Variables, Bit Operations, Bit Operations, Bit Operations, Bit Operations, Bit Operations, Bit Operations, Bit Operations, Bit Operations, Blocking and Nonblocking Operations, String Operations, Quick Reference, Hotplug Operations, Hotplug Operations, The mmap Device OperationRemapping Kernel Virtual Addresses, Asynchronous I/O, Block device operations, The Device Methods
aio_fsync, Asynchronous I/O
atomic_add, Atomic Variables
atomic_dec, Atomic Variables
atomic_dec_and_test, Atomic Variables
atomic_inc, Atomic Variables
atomic_inc_and_test, Atomic Variables
atomic_read, Atomic Variables
atomic_set, Atomic Variables
atomic_sub, Atomic Variables
atomic_sub_and_test, Atomic Variables
bit, Bit Operations
block drivers, Block device operations
blocking/nonblocking, Blocking and Nonblocking Operations
change_bit, Bit Operations
clear_bit, Bit Operations
devices, The Device Methods
files, File OperationsFile Operations
filter hotplug, Hotplug Operations
flush, File Operations
hotplugs, Hotplug Operations
mmap devices, The mmap Device OperationRemapping Kernel Virtual Addresses
set_bit, Bit Operations
spinlocks, The Spinlock Functions
string, String Operations, Quick Reference
sysrq, System Hangs
test_and_change_bit, Bit Operations
test_and_clear_bit, Bit Operations
test_and_set_bit, Bit Operations
test_bit, Bit Operations
vector, readv and writev
poll, File Operations, poll and selectThe Underlying Data Structure, The Device Methods
poll_controller, Netpoll
populate, The vm_area_struct structure
pread, read and write
proc_read, Implementing files in /proc
pwrite, read and write
read, File Operations, The file Structure, read and write, read and write, The read Method, The read Method, Debugging by Watching, Oops Messages, Reading data from the device, Talking to the DMA controller
arguments to, read and write
code for, The read Method
configuring DMA controllers, Talking to the DMA controller
f_pso field (file structure) and, The file Structure
oops messages, Oops Messages
poll method and, Reading data from the device
rules for interpreting return values, The read Method
strace command and, Debugging by Watching
readdir, File Operations
readv, File Operations
rebuild_header, The Device Methods
release, File Operations, The release Method, The release Method, Blocking open as an Alternative to EBUSY, Cloning the Device on open, Release functions and kobject types, The open and release Methods
block drivers, The open and release Methods
blocking, Blocking open as an Alternative to EBUSY
cloning devices, Cloning the Device on open
kobjects, Release functions and kobject types
revalidate, Supporting Removable Media
sbull ioctl, The ioctl Method
select, poll and selectThe Underlying Data Structure
select, poll method and, File Operations
set_config, The Device Methods
set_mac_address, The Device Methods
set_multicast_list, Interface Information, The Device Methods, Kernel Support for Multicasting
show, The seq_file interface, Default Attributes
kobjects, Default Attributes
seq_file interface, The seq_file interface
start, The seq_file interface
stop, The Device Methods
store (kobjects), Default Attributes
strace command and, Debugging by Watching
struct module *owner, File Operations
tx_timeout, The Device Methods
unsigned long, File Operations
write, File Operations, The file Structure, read and write, The write Method, The write Method, Oops Messages, Writing to the device
code for, The write Method
f_pos field (file structure) and, The file Structure
interpreting rules for return values, The write Method
oops messages, Oops Messages
poll method and, Writing to the device
writev, File Operations, readv and writev
鼠标、异步通知输入
mice, Asynchronous Notification, Input
异步通知,异步通知
热插拔、输入
asynchronous notification, Asynchronous Notification
hotplugging, Input
微通道架构(MCA),MCA
Micro Channel Architecture (MCA), MCA
微秒分辨率,了解当前时间
microsecond resolution, Knowing the Current Time
MII(媒体独立接口),媒体独立接口支持
MII (Media Independent Interface), Media Independent Interface Support
次设备号、主设备号和次设备号
minor device numbers, Major and Minor Numbers
MINOR 宏,快速参考,快速参考
MINOR macro, Quick Reference, Quick Reference
次要编号、字符驱动程序、主要编号和次要编号主编号的动态分配
minor numbers, char drivers, Major and Minor NumbersDynamic Allocation of Major Numbers
MIPS 处理器、处理器特定寄存器平台依赖性
MIPS processor, Processor-Specific Registers, Platform Dependencies
内联汇编代码和处理器特定寄存器
移植和平台依赖性
inline assembly code and, Processor-Specific Registers
porting and, Platform Dependencies
Misc-progs 目录,重定向控制台消息测试 Scullpipe 驱动程序
misc-progs directory, Redirecting Console Messages, Testing the Scullpipe Driver
中断缓解,接收中断缓解
mitigation of interrupts, Receive Interrupt Mitigation
MKDEV 宏,快速参考
MKDEV macro, Quick Reference
mlock 系统调用,在用户空间中执行
mlock system call, Doing It in User Space
mmap、Linux 中的内存管理高内存和低内存vm_area_struct 结构mmap 设备操作重新映射内核虚拟地址实现 mmap
mmap, Memory Management in LinuxHigh and Low Memory, The vm_area_struct structure, The mmap Device OperationRemapping Kernel Virtual Addresses, Implementing mmap
(另请参阅内存管理)
设备操作,mmap 设备操作重新映射内核虚拟地址
实现,Linux 中的内存管理高内存和低内存实现 mmap
(see also memory management)
device operations, The mmap Device OperationRemapping Kernel Virtual Addresses
implementation, Memory Management in LinuxHigh and Low Memory, Implementing mmap
mmap方法、文件操作vm_area_struct结构添加VMA操作
mmap method, File Operations, The vm_area_struct structure, Adding VMA Operations
使用计数以及添加 VMA 操作
vm_area_struct结构和vm_area_struct结构
usage count and, Adding VMA Operations
vm_area_struct structure and, The vm_area_struct structure
模式(级别)、CPU、用户空间和内核空间
modalities (levels), CPU, User Space and Kernel Space
模型(Linux 设备)、Linux 设备模型Linux 设备模型Kobject、Kset 和子系统子系统低级 Sysfs 操作符号链接热插拔事件生成总线总线属性设备驱动程序结构嵌入类接口将它们放在一起删除驱动程序热插拔udev处理固件怎么运行的
models (Linux device), The Linux Device ModelThe Linux Device Model, Kobjects, Ksets, and SubsystemsSubsystems, Low-Level Sysfs OperationsSymbolic Links, Hotplug Event Generation, BusesBus attributes, DevicesDriver structure embedding, ClassesClass interfaces, Putting It All TogetherRemove a Driver, Hotplugudev, Dealing with FirmwareHow It Works
总线、总线总线属性
类,——类接口
devices, Devices驱动程序结构嵌入
固件,处理固件它是如何工作的
热插拔、热插拔事件生成热插拔udev
kobjects、Kobjects、Ksets 和子系统子系统
生命周期,将它们放在一起删除驱动程序
低级 sysfs 操作,低级 Sysfs 操作符号链接
buses, BusesBus attributes
classes, ClassesClass interfaces
devices, DevicesDriver structure embedding
firmware, Dealing with FirmwareHow It Works
hotplugging, Hotplug Event Generation, Hotplugudev
kobjects, Kobjects, Ksets, and SubsystemsSubsystems
lifecycles, Putting It All TogetherRemove a Driver
low-level sysfs operations, Low-Level Sysfs OperationsSymbolic Links
模式、主编号动态分配文件结构内核定时器TaskletTasklet
modes, Dynamic Allocation of Major Numbers, The file Structure, Kernel Timers, TaskletsTasklets
设备模式、主号码动态分配
文件模式、文件结构
中断、内核定时器TaskletTasklet
异步执行,内核定时器
小任务,小任务小任务
device modes, Dynamic Allocation of Major Numbers
file modes, The file Structure
interrupt, Kernel Timers, TaskletsTasklets
asynchronous execution, Kernel Timers
tasklets, TaskletsTasklets
mode_t f_mode (struct file field),文件结构
mode_t f_mode (struct file field), The file Structure
mode_t 模式变量(USB)、探测和断开详细信息
mode_t mode variable (USB), probe and disconnect in Detail
modprobe 实用程序、加载和卸载模块内核符号表内核符号表模块参数
modprobe utility, Loading and Unloading Modules, The Kernel Symbol Table, The Kernel Symbol Table, Module Parameters
分配参数值,模块参数
insmod 程序与内核符号表
assigning parameter values, Module Parameters
insmod program versus, The Kernel Symbol Table
模块化、分层、内核符号表
modularization, layered, The Kernel Symbol Table
module.h 头文件,快速参考
module.h header file, Quick Reference
模块、可加载模块可加载模块设备和模块类设备和模块类安全问题许可条款设置测试系统Hello World 模块Hello World 模块内核模块与应用程序一些其他详细信息,内核模块与应用程序,内核模块与应用程序,内核模块与应用程序,当前流程,编译模块编译模块加载和卸载模块加载和卸载模块-平台依赖性加载和卸载模块版本依赖性平台依赖性平台依赖性内核符号表-内核符号表内核符号表内核符号表准备工作初始化和关闭模块加载竞赛模块加载竞赛模块参数模块参数在用户空间中执行操作在用户空间中执行操作快速参考主编号动态分配主编号动态分配、主编号动态分配、哎呀消息、完成示例驱动程序内核辅助探测使用标准C类型模块卸载
modules, Loadable Modules, Loadable Modules, Classes of Devices and Modules, Classes of Devices and Modules, Security Issues, License Terms, Setting Up Your Test System, The Hello World ModuleThe Hello World Module, Kernel Modules Versus ApplicationsA Few Other Details, Kernel Modules Versus Applications, Kernel Modules Versus Applications, Kernel Modules Versus Applications, The Current Process, Compiling ModulesCompiling Modules, Loading and Unloading Modules, Loading and Unloading ModulesPlatform Dependency, Loading and Unloading Modules, Version Dependency, Platform Dependency, Platform Dependency, The Kernel Symbol TableThe Kernel Symbol Table, The Kernel Symbol Table, The Kernel Symbol Table, Preliminaries, Initialization and ShutdownModule-Loading Races, Module-Loading Races, Module ParametersModule Parameters, Doing It in User SpaceDoing It in User Space, Quick Reference, Dynamic Allocation of Major Numbers, Dynamic Allocation of Major Numbers, Dynamic Allocation of Major Numbers, Oops Messages, Completions, A Sample Driver, Kernel-assisted probing, Use of Standard C Types, Module Unloading
应用程序、内核模块与应用程序——一些其他细节
授权、安全问题
基本模块参数,示例驱动程序
代码要求,预备知识
编译,编译模块编译模块
完成,完成
当前流程和当前流程
动态模块分配,主编号动态分配
动态号码分配,主要号码动态分配
错误(哎呀消息),哎呀消息
文件,快速参考
文件系统、设备类和模块
内核模块与应用程序的头文件
你好世界,你好世界模块你好世界模块
初始化、初始化和关闭——模块加载竞赛
kdatasize,标准 C 类型的使用
许可条款,许可条款
加载、内核模块与应用程序加载和卸载模块加载和卸载模块-平台依赖性模块加载竞赛主要编号的动态分配
insmod 程序以及加载和卸载模块
竞赛、模块加载竞赛
使用初始化脚本,动态分配主号码
参数,模块参数模块参数
平台依赖性、平台依赖性平台依赖性
SCSI,设备和模块类别
简短的内核辅助探测
堆栈,内核符号表内核符号表
符号,内核符号表内核符号表
测试系统设置,设置您的测试系统
卸载、内核模块与应用程序加载和卸载模块模块卸载
用户空间编程,在用户空间中进行在用户空间中进行
版本依赖,版本依赖
applications, Kernel Modules Versus ApplicationsA Few Other Details
authorization, Security Issues
base module parameter, A Sample Driver
code requirements, Preliminaries
compiling, Compiling ModulesCompiling Modules
complete, Completions
current process and, The Current Process
dynamic module assignment, Dynamic Allocation of Major Numbers
dynamic number assignment, Dynamic Allocation of Major Numbers
faulty (oops messages), Oops Messages
files, Quick Reference
filesystem, Classes of Devices and Modules
header files of, Kernel Modules Versus Applications
hello world, The Hello World ModuleThe Hello World Module
initialization, Initialization and ShutdownModule-Loading Races
kdatasize, Use of Standard C Types
license terms, License Terms
loading, Kernel Modules Versus Applications, Loading and Unloading Modules, Loading and Unloading ModulesPlatform Dependency, Module-Loading Races, Dynamic Allocation of Major Numbers
insmod program and, Loading and Unloading Modules
races, Module-Loading Races
using init scripts, Dynamic Allocation of Major Numbers
parameters, Module ParametersModule Parameters
platform dependency, Platform Dependency, Platform Dependency
SCSI, Classes of Devices and Modules
short, Kernel-assisted probing
stacking, The Kernel Symbol Table, The Kernel Symbol Table
symbols, The Kernel Symbol TableThe Kernel Symbol Table
test system setup, Setting Up Your Test System
unloading, Kernel Modules Versus Applications, Loading and Unloading Modules, Module Unloading
user-space programming, Doing It in User SpaceDoing It in User Space
version dependency, Version Dependency
MODULE_ALIAS 宏,快速参考
MODULE_ALIAS macro, Quick Reference
MODULE_AUTHOR 宏,快速参考
MODULE_AUTHOR macro, Quick Reference
MODULE_DESCRIPTION 宏,快速参考
MODULE_DESCRIPTION macro, Quick Reference
MODULE_DEVICE_TABLE 宏,快速参考MODULE_DEVICE_TABLE
MODULE_DEVICE_TABLE macro, Quick Reference, MODULE_DEVICE_TABLE
module_init函数,初始化和关闭
module_init function, Initialization and Shutdown
module_param 宏,模块参数快速参考
module_param macro, Module Parameters, Quick Reference
mod_timer函数,定时器API内核定时器的实现
mod_timer function, The Timer API, The Implementation of Kernel Timers
监控、观察调试
monitoring, Debugging by Watching
内核(调试方式),通过观看调试
kernels (debugging by), Debugging by Watching
mremap 系统调用、使用 nopage 映射内存重新映射特定 I/O 区域
mremap system calls, Mapping Memory with nopage, Remapping Specific I/O Regions
MSR 寄存器、ioctls
MSR register, ioctls
MTU、网络设备和设备方法
MTU, network devices and, The Device Methods
多播、接口信息多播典型实现
multicasting, Interface Information, MulticastA Typical Implementation
IFF_MULTICAST 标志和接口信息
网络驱动程序、组播典型实现
IFF_MULTICAST flag and, Interface Information
network drivers, MulticastA Typical Implementation
互斥体、信号量和互斥体Linux 信号量实现
mutexes, Semaphores and Mutexes, The Linux Semaphore Implementation
初始化,Linux 信号量实现
initialization, The Linux Semaphore Implementation
互斥、并发及其管理
mutual exclusion, Concurrency and Its Management

N

名称字段(总线),总线
name field (buses), Buses
名称变量,输入
NAME variable, Input
命名、USB 和 Sysfs分配 IP 号
naming, USB and Sysfs, Assigning IP Numbers
IP 号码、分配 IP 号码
sysfs 目录树 (USB)、USB 和 Sysfs
IP numbers, Assigning IP Numbers
sysfs directory tree (USB), USB and Sysfs
数据项的自然对齐,数据对齐
natural alignment of data items, Data Alignment
nbtest 程序,测试 Scullpipe 驱动程序
nbtest program, Testing the Scullpipe Driver
netif_carrier_off 函数,链路状态的变化
netif_carrier_off function, Changes in Link State
netif_carrier_ok 函数,链路状态的变化
netif_carrier_ok function, Changes in Link State
netif_carrier_on 函数,链路状态的变化
netif_carrier_on function, Changes in Link State
netif_start_queue函数,打开和关闭
netif_start_queue function, Opening and Closing
netif_stop_queue函数,开启和关闭控制传输并发
netif_stop_queue function, Opening and Closing, Controlling Transmission Concurrency
netif_wake_queue函数,控制传输并发
netif_wake_queue function, Controlling Transmission Concurrency
网络轮询,网络轮询
netpoll, Netpoll
网络设备、网络
network devices, Networking
网络驱动程序、网络驱动程序snull 的设计方式数据包的物理传输连接到内核实用程序字段设备方法打开和关闭打开和关闭中断处理程序链路状态的变化MAC 地址解析非以太网标头自定义 ioctl 命令统计信息组播典型实现快速参考快速参考
network drivers, Network Drivers, How snull Is DesignedThe Physical Transport of Packets, Connecting to the KernelUtility Fields, The Device Methods, Opening and ClosingOpening and Closing, The Interrupt Handler, Changes in Link State, MAC Address ResolutionNon-Ethernet Headers, Custom ioctl Commands, Statistical Information, MulticastA Typical Implementation, Quick ReferenceQuick Reference
功能,快速参考快速参考
中断处理程序,中断处理程序
ioctl 命令、自定义 ioctl 命令
内核连接,连接到内核实用程序字段
链接状态(变化),链接状态变化
MAC 地址(解析)、MAC 地址解析非以太网标头
设备方法的方法
多播,多播典型实现
开幕、开幕和闭幕开幕和闭幕
snull,snull 是如何设计的数据包的物理传输
统计、统计信息
functions, Quick ReferenceQuick Reference
interrupt handlers for, The Interrupt Handler
ioctl commands, Custom ioctl Commands
kernel connections, Connecting to the KernelUtility Fields
link state (changes in), Changes in Link State
MAC addresses (resolution of), MAC Address ResolutionNon-Ethernet Headers
methods of, The Device Methods
multicasting, MulticastA Typical Implementation
opening, Opening and ClosingOpening and Closing
snull, How snull Is DesignedThe Physical Transport of Packets
statistics, Statistical Information
网络、拆分内核拆分内核设备和模块类
networks, Splitting the Kernel, Splitting the Kernel, Classes of Devices and Modules
接口、设备类别和模块
管理,分裂内核
interfaces, Classes of Devices and Modules
management, Splitting the Kernel
net_device 结构、设备注册net_device 结构详细信息-硬件信息接口信息设备方法
net_device structure, Device Registration, The net_device Structure in DetailHardware Information, Interface Information, The Device Methods
设备方法,设备方法
接口标志,接口信息
device methods of, The Device Methods
interface flags for, Interface Information
net_device_stats结构体,初始化各个设备统计信息
net_device_stats structure, Initializing Each Device, Statistical Information
net_init.c 文件,接口信息
net_init.c file, Interface Information
next 方法,seq_file 接口
next method, The seq_file interface
非以太网标头,非以太网标头
non-Ethernet headers, Non-Ethernet Headers
非以太网接口,接口信息
non-Ethernet interfaces, Interface Information
非阻塞操作、阻塞和非阻塞操作阻塞和非阻塞操作
nonblocking operations, Blocking and Nonblocking Operations, Blocking and Nonblocking Operations
非默认属性(kobjects),非默认属性
nondefault attributes (kobjects), Nondefault Attributes
非抢占和并发,内核中的并发
nonpreemption and concurrency, Concurrency in the Kernel
不可重试的请求,不可重试的请求
nonretryable requests, Nonretryable requests
nopage 方法、vm_area_struct 结构使用 nopage 映射内存使用 nopage 映射内存重新映射特定 I/O 区域使用 nopage 方法重新映射 RAM
nopage method, The vm_area_struct structure, Mapping Memory with nopage, Mapping Memory with nopage, Remapping Specific I/O Regions, Remapping RAM with the nopage method
mremap 系统调用,使用 nopage 映射内存
防止映射扩展,重新映射特定 I/O 区域
重新映射 RAM,使用 nopage 方法重新映射 RAM
mremap system call with, Mapping Memory with nopage
preventing extension of mapping, Remapping Specific I/O Regions
remapping RAM, Remapping RAM with the nopage method
正常内存区,内存区
normal memory zone, Memory zones
通知(异步),异步通知驱动程序的观点
notification (asynchronous), Asynchronous NotificationThe Driver's Point of View
nr_frags 字段,分散/聚集 I/O
nr_frags field, Scatter/Gather I/O
NR_IRQS 符号,DIY 探测
NR_IRQS symbol, Do-it-yourself probing
努巴士,努巴士
NuBus, NuBus
NUMA(非均匀内存访问)系统、内存区域内存映射和结构页
NUMA (nonuniform memory access) systems, Memory zones, The Memory Map and Struct Page
编号、版本编号版本编号主编号和次编号主编号的动态分配打印设备编号安装中断处理程序USB 和 Sysfs物理地址和页面分配 IP 编号
numbers, Version NumberingVersion Numbering, Major and Minor NumbersDynamic Allocation of Major Numbers, Printing Device Numbers, Installing an Interrupt Handler, USB and Sysfs, Physical Addresses and Pages, Assigning IP Numbers
设备(打印),打印设备编号
中断,安装中断处理程序
IP(分配)、分配 IP 号码
主要和次要、主要和次要号码-主要号码的动态分配
PFN、物理地址和页面
根集线器 (USB)、USB 和 Sysfs
版本,版本编号版本编号
devices (printing), Printing Device Numbers
interrupt, Installing an Interrupt Handler
IP (assignment of), Assigning IP Numbers
major and minor, Major and Minor NumbersDynamic Allocation of Major Numbers
PFN, Physical Addresses and Pages
root hubs (USB), USB and Sysfs
versions, Version NumberingVersion Numbering

O

对象、并发及其管理Linux 设备模型Kobject、Kset 和子系统子系统Kobject、Kset 和子系统低级 Sysfs 操作符号链接热插拔事件生成
objects, Concurrency and Its Management, The Linux Device Model, Kobjects, Ksets, and SubsystemsSubsystems, Kobjects, Ksets, and Subsystems, Low-Level Sysfs OperationsSymbolic Links, Hotplug Event Generation
kobjects、Kobjects、Ksets 和子系统子系统Kobjects、Ksets 和子系统低级 Sysfs 操作符号链接热插拔事件生成
(另请参阅 kobjects)
热插拔事件生成,热插拔事件生成
低级 sysfs 操作,低级 Sysfs 操作符号链接
生命周期,Linux 设备模型
共享、并发及其管理
kobjects, Kobjects, Ksets, and SubsystemsSubsystems, Kobjects, Ksets, and Subsystems, Low-Level Sysfs OperationsSymbolic Links, Hotplug Event Generation
(see also kobjects)
hotplug event generation, Hotplug Event Generation
low-level sysfs operations, Low-Level Sysfs OperationsSymbolic Links
lifecycles, The Linux Device Model
sharing, Concurrency and Its Management
八位位组、网络驱动程序
octets, Network Drivers
旧的接口,旧的方式旧的接口
older interfaces, The Older Way, An older interface
字符设备注册,旧方法
/proc 文件实现,较旧的接口
char device registration, The Older Way
/proc file implementation, An older interface
哎呀消息,哎呀消息哎呀消息
oops messages, Oops MessagesOops Messages
打开文件,文件结构
open files, The file Structure
open 函数 (tty 驱动程序), open 和 close打开和关闭
open function (tty drivers), open and closeopen and close
open 方法、文件操作文件结构open 方法单次打开设备限制一次访问单个用户阻止打开作为 EBUSY 的替代方案vm_area_struct 结构注册 DMA 使用打开和释放方法,设备方法
open method, File Operations, The file Structure, The open Method, Single-Open Devices, Restricting Access to a Single User at a Time, Blocking open as an Alternative to EBUSY, The vm_area_struct structure, Registering DMA usage, The open and release Methods, The Device Methods
块驱动程序的打开和释放方法
阻塞,阻塞打开作为 EBUSY 的替代方案
对于网络设备,设备方法
private_data 和文件结构
请求 DMA 通道、注册 DMA 使用
限制并发用户以及限制一次单个用户的访问
用于单开设备、单开设备
vm_operations_struct 结构、vm_area_struct 结构
block drivers, The open and release Methods
blocking, Blocking open as an Alternative to EBUSY
for network devices, The Device Methods
private_data and, The file Structure
requesting DMA channels, Registering DMA usage
restricting simultaneous users and, Restricting Access to a Single User at a Time
for single-open devices, Single-Open Devices
vm_operations_struct structure, The vm_area_struct structure
打开网络驱动程序,打开和关闭打开和关闭
opening network drivers, Opening and ClosingOpening and Closing
操作,文件操作文件操作文件操作readv 和 writev系统挂起自旋锁函数原子变量、原子变量原子变量原子变量原子变量、原子变量、原子变量原子变量原子变量位运算,位运算,位运算,位运算,位操作位操作位操作位操作阻塞和非阻塞操作阻塞和非阻塞操作字符串操作快速参考kset 上的操作、低级 Sysfs 操作符号链接热插拔操作热插拔操作总线方法vm_area_struct 结构vm_area_struct 结构vm_area_struct 结构vm_area_struct 结构vm_area_struct 结构mmap 设备操作-重新映射内核虚拟地址添加 VMA 操作异步 I/O块设备操作块设备操作- ioctl 方法分配 IP 编号设备方法tty_操作详细结构
operations, File OperationsFile Operations, File Operations, readv and writev, System Hangs, The Spinlock Functions, Atomic Variables, Atomic Variables, Atomic Variables, Atomic Variables, Atomic Variables, Atomic Variables, Atomic Variables, Atomic Variables, Atomic Variables, Bit Operations, Bit Operations, Bit Operations, Bit Operations, Bit Operations, Bit Operations, Bit Operations, Bit Operations, Blocking and Nonblocking Operations, Blocking and Nonblocking Operations, String Operations, Quick Reference, Operations on ksets, Low-Level Sysfs OperationsSymbolic Links, Hotplug Operations, Hotplug Operations, Bus methods, The vm_area_struct structure, The vm_area_struct structure, The vm_area_struct structure, The vm_area_struct structure, The vm_area_struct structure, The mmap Device OperationRemapping Kernel Virtual Addresses, Adding VMA Operations, Asynchronous I/O, Block device operations, The Block Device OperationsThe ioctl Method, Assigning IP Numbers, The Device Methods, The tty_operations Structure in Detail
aio_fsync,异步I/O
atomic_add,原子变量
atomic_dec,原子变量
atomic_dec_and_test,原子变量
atomic_inc,原子变量
atomic_inc_and_test,原子变量
atomic_read,原子变量
atomic_set,原子变量
atomic_sub,原子变量
atomic_sub_and_test,原子变量
位、位运算
块驱动程序、块设备操作块设备操作ioctl 方法
阻塞、阻塞和非阻塞操作
Change_bit,位操作
clear_bit,位操作
设备,设备方法
文件,文件操作文件操作
过滤操作、热插拔操作
刷新,文件操作
热插拔、热插拔操作
ksets 上,ksets 上的操作
低级 sysfs、低级 Sysfs 操作符号链接
方法、总线方法vm_area_struct 结构vm_area_struct 结构vm_area_struct 结构vm_area_struct 结构vm_area_struct 结构
(另见方法)
巴士,巴士方法
关闭,vm_area_struct结构
无页, vm_area_struct结构
打开, vm_area_struct结构体
填充,vm_area_struct 结构
mmap 设备,mmap 设备操作重新映射内核虚拟地址
非阻塞,阻塞和非阻塞操作
设置位,位操作
snull 接口,分配 IP 号
自旋锁,自旋锁函数
字符串,字符串操作快速参考
系统请求,系统挂起
测试和更改位,位操作
测试和清除位,位操作
测试和设置位,位操作
测试位,位操作
tty_操作结构, tty_operations 结构详细信息
向量,readv 和 writev
VMA(添加),添加VMA操作
aio_fsync, Asynchronous I/O
atomic_add, Atomic Variables
atomic_dec, Atomic Variables
atomic_dec_and_test, Atomic Variables
atomic_inc, Atomic Variables
atomic_inc_and_test, Atomic Variables
atomic_read, Atomic Variables
atomic_set, Atomic Variables
atomic_sub, Atomic Variables
atomic_sub_and_test, Atomic Variables
bit, Bit Operations
block drivers, Block device operations, The Block Device OperationsThe ioctl Method
blocking, Blocking and Nonblocking Operations
change_bit, Bit Operations
clear_bit, Bit Operations
devices, The Device Methods
files, File OperationsFile Operations
filter operation, Hotplug Operations
flush, File Operations
hotplugs, Hotplug Operations
on ksets, Operations on ksets
low-level sysfs, Low-Level Sysfs OperationsSymbolic Links
methods, Bus methods, The vm_area_struct structure, The vm_area_struct structure, The vm_area_struct structure, The vm_area_struct structure, The vm_area_struct structure
(see also methods)
buses, Bus methods
close, The vm_area_struct structure
nopage, The vm_area_struct structure
open, The vm_area_struct structure
populate, The vm_area_struct structure
mmap devices, The mmap Device OperationRemapping Kernel Virtual Addresses
nonblocking, Blocking and Nonblocking Operations
set_bit, Bit Operations
snull interfaces, Assigning IP Numbers
spinlocks, The Spinlock Functions
string, String Operations, Quick Reference
sysrq, System Hangs
test_and_change_bit, Bit Operations
test_and_clear_bit, Bit Operations
test_and_set_bit, Bit Operations
test_bit, Bit Operations
tty_operations structure, The tty_operations Structure in Detail
vector, readv and writev
VMAs (adding), Adding VMA Operations
优化、编译器、I/O 寄存器和常规内存
optimizations, compiler, I/O Registers and Conventional Memory
选项(配置),内核中的调试支持内核中的调试支持
options (configuration), Debugging Support in the KernelDebugging Support in the Kernel
排序锁定(规则),锁定排序规则
ordering locking (rules for), Lock Ordering Rules
outb 函数,操作 I/O 端口
outb function, Manipulating I/O ports
outb_p 函数,暂停 I/O
outb_p function, Pausing I/O
outl 函数,操作 I/O 端口
outl function, Manipulating I/O ports
输出、阻塞和非阻塞操作刷新挂起输出与硬件通信I/O 端口示例示例驱动程序
output, Blocking and Nonblocking Operations, Flushing pending output, Communicating with Hardware, An I/O Port Example, A Sample Driver
缓冲区、阻塞和非阻塞操作
冲洗挂起,冲洗挂起输出
引脚、与硬件通信I/O 端口示例示例驱动程序
buffers, Blocking and Nonblocking Operations
flushing pending, Flushing pending output
pins, Communicating with Hardware, An I/O Port Example, A Sample Driver
outsb 函数,字符串操作
outsb function, String Operations
outsl 函数,字符串操作
outsl function, String Operations
outsw 函数,字符串操作
outsw function, String Operations
outw 函数,操作 I/O 端口
outw function, Manipulating I/O ports
覆盖 ARP,覆盖 ARP
overriding ARP, Overriding ARP
溢出(缓冲区),哎呀消息
overruns (buffers), Oops Messages
O_NDELAY 标志(f_flags 字段),阻塞和非阻塞操作
O_NDELAY flag (f_flags field), Blocking and Nonblocking Operations
O_NONBLOCK 标志(f_flags 字段)、文件结构预定义命令阻塞和非阻塞操作从设备读取数据
O_NONBLOCK flag (f_flags field), The file Structure, The Predefined Commands, Blocking and Nonblocking Operations, Reading data from the device
读/写方法以及从设备读取数据
read/write methods and, Reading data from the device
O_RDONLY 标志(f_flags 字段),文件结构
O_RDONLY flag (f_flags field), The file Structure
O_SYNC 标志(f_flags 字段),文件结构
O_SYNC flag (f_flags field), The file Structure

P

包、升级、版本编号
packages, upgrading, Version Numbering
数据包、拆分内核数据包的物理传输数据包传输-传输超时中断处理程序多播的内核支持
packets, Splitting the Kernel, The Physical Transport of Packets, Packet TransmissionTransmission Timeouts, The Interrupt Handler, Kernel Support for Multicasting
管理,分裂内核
多播,多播的内核支持
接收,中断处理程序
传输、数据包的物理传输数据包传输传输超时
management, Splitting the Kernel
multicasting, Kernel Support for Multicasting
reception, The Interrupt Handler
transmission, The Physical Transport of Packets, Packet TransmissionTransmission Timeouts
PACKET_BROADCAST 标志,重要字段
PACKET_BROADCAST flag, The Important Fields
PACKET_HOST 标志,重要字段
PACKET_HOST flag, The Important Fields
PACKET_MULTICAST 标志,重要字段
PACKET_MULTICAST flag, The Important Fields
PACKET_OTHERHOST 标志,重要字段
PACKET_OTHERHOST flag, The Important Fields
页框号 (PFN)、物理地址和页
page frame number (PFN), Physical Addresses and Pages
面向页面的分配函数,get_free_page 和 Friends快速参考
page-oriented allocation functions, get_free_page and Friends, Quick Reference
page.h头文件,页面大小
page.h header file, Page Size
页、Oops 消息alloc_pages 接口使用 I/O 内存页大小物理地址和页页表使用 nopage 映射内存
pages, Oops Messages, The alloc_pages Interface, Using I/O Memory, Page Size, Physical Addresses and Pages, Page Tables, Mapping Memory with nopage
分配器,alloc_pages 接口
由无效指针、 Oops 消息引起的故障
物理地址、物理地址和页
尺寸和便携性,页面尺寸
表、使用 I/O 内存页表使用 nopage 映射内存
I/O 内存以及使用 I/O 内存
nopage VMA方法,用nopage映射内存
allocators, The alloc_pages Interface
faults caused by invalid pointers, Oops Messages
physical addresses, Physical Addresses and Pages
size and portability, Page Size
tables, Using I/O Memory, Page Tables, Mapping Memory with nopage
I/O memory and, Using I/O Memory
nopage VMA method, Mapping Memory with nopage
page_address 宏、内存映射和结构页
page_address macro, The Memory Map and Struct Page
PAGE_SHIFT 宏,物理地址和页面
PAGE_SHIFT macro, Physical Addresses and Pages
PAGE_SHIFT 符号,页面大小
PAGE_SHIFT symbol, Page Size
PAGE_SIZE 符号,页面大小mmap 设备操作
PAGE_SIZE symbol, Page Size, The mmap Device Operation
并行端口、内核符号表并行端口概述-示例驱动程序准备并行端口禁用单个中断
parallel ports, The Kernel Symbol Table, An Overview of the Parallel PortA Sample Driver, Preparing the Parallel Port, Disabling a single interrupt
中断处理程序,准备并行端口禁用单个中断
禁用,禁用单个中断
准备,准备并行端口
堆栈驱动程序模块,内核符号表
interrupt handlers, Preparing the Parallel Port, Disabling a single interrupt
disabling, Disabling a single interrupt
preparing for, Preparing the Parallel Port
stacking driver modules, The Kernel Symbol Table
param.h 头文件,测量时间间隔
param.h header file, Measuring Time Lapses
参数、模块参数模块参数模块参数示例驱动程序
parameters, Module ParametersModule Parameters, Module Parameters, A Sample Driver
赋值、模块参数
基本模块,示例驱动程序
模块、模块参数模块参数
assigning values, Module Parameters
base module, A Sample Driver
modules, Module ParametersModule Parameters
PARENB 位掩码,set_termios
PARENB bitmask, set_termios
PARODD 位掩码,set_termios
PARODD bitmask, set_termios
部分数据传输,读方法写方法
partial data transfers, The read Method, The write Method
读取方法,读取方法
写方法,写方法
read method, The read Method
write method, The write Method
密码、安全问题
passwords, Security Issues
暂停 I/O,暂停 I/O
pausing I/O, Pausing I/O
PC 并行接口,并行端口概述
PC parallel interface, An Overview of the Parallel Port
PCI(外围组件互连)、vmalloc 和朋友PCI 接口硬件抽象回顾:ISA即插即用规范PC/104 和 PC/104+MCAEISAVLBSBusNuBus ,外部总线,快速参考,快速参考,添加设备,删除设备,添加驱动程序,删除驱动程序, PCI 双地址周期映射,一个简单的PCI DMA例子
PCI (Peripheral Component Interconnect), vmalloc and Friends, The PCI InterfaceHardware Abstractions, A Look Back: ISAThe Plug-and-Play Specification, PC/104 and PC/104+, MCA, EISA, VLB, SBus, NuBus, External Buses, Quick Reference, Quick Reference, Add a Device, Remove a Device, Add a Driver, Remove a Driver, PCI double-address cycle mappings, A simple PCI DMA example
设备、添加设备删除设备
添加,添加设备
删除、移除设备
DMA,一个简单的 PCI DMA 示例
双地址周期映射、PCI 双地址周期映射
驱动程序、添加驱动程序删除驱动程序
添加,添加驱动程序
删除、删除驱动程序
环境影响评估,环境影响评估
扩展总线、外部总线
接口,PCI 接口硬件抽象
ISA,回顾:ISA——即插即用规范
列表、快速参考
马华、马华
努巴士,努巴士
PC/104 和 PC/104+、PC/104 和 PC/104+
系统总线、系统总线
搜索、快速参考
VLB, VLB
devices, Add a Device, Remove a Device
adding, Add a Device
deleting, Remove a Device
DMA, A simple PCI DMA example
double-address cycle mappings, PCI double-address cycle mappings
drivers, Add a Driver, Remove a Driver
adding, Add a Driver
deleting, Remove a Driver
EISA, EISA
extended buses, External Buses
interfaces, The PCI InterfaceHardware Abstractions
ISA, A Look Back: ISAThe Plug-and-Play Specification
lists, Quick Reference
MCA, MCA
NuBus, NuBus
PC/104 and PC/104+, PC/104 and PC/104+
SBus, SBus
searching, Quick Reference
VLB, VLB
pci_bus_type 变量,添加设备
pci_bus_type variable, Add a Device
PCI_CLASS 变量,PCI
PCI_CLASS variable, PCI
PCI_DEVICE 宏、配置寄存器和初始化
PCI_DEVICE macro, Configuration Registers and Initialization
PCI_DEVICE_CLASS 宏、配置寄存器和初始化
PCI_DEVICE_CLASS macro, Configuration Registers and Initialization
PCI_DMA_FROMDEVICE 符号,设置流 DMA 映射
PCI_DMA_FROMDEVICE symbol, Setting up streaming DMA mappings
PCI_DMA_TODEVICE 符号,设置流 DMA 映射
PCI_DMA_TODEVICE symbol, Setting up streaming DMA mappings
PCI_ID 变量,PCI
PCI_ID variable, PCI
pci_map_sg 函数,分散/聚集映射
pci_map_sg function, Scatter/gather mappings
pci_remove_bus_device 函数,删除设备
pci_remove_bus_device function, Remove a Device
pci_resource_ 函数,访问 I/O 和内存空间
pci_resource_ functions, Accessing the I/O and Memory Spaces
PCI_SLOT_NAME 变量,PCI
PCI_SLOT_NAME variable, PCI
PCI_SUBSYS_ID 变量,PCI
PCI_SUBSYS_ID variable, PCI
PDEBUG/PDEBUGG 符号,打开和关闭消息
PDEBUG/PDEBUGG symbols, Turning the Messages On and Off
待处理输出、冲洗、冲洗待处理输出冲洗待处理输出
pending output, flushing, Flushing pending output, Flushing pending output
每 CPU 变量、每 CPU 变量每 CPU 变量
per-CPU variables, Per-CPU VariablesPer-CPU Variables
性能、阻塞和非阻塞操作get_free_page 和朋友I/O 寄存器和常规内存字符串操作mmap 设备操作数据包接收
performance, Blocking and Nonblocking Operations, get_free_page and Friends, I/O Registers and Conventional Memory, String Operations, The mmap Device Operation, Packet Reception
分配套接字缓冲区、数据包接收
通过分配过多内存、get_free_page 和 Friends来降低性能
内存屏障、I/O 寄存器和传统内存
mmap方法,mmap设备操作
输出缓冲区以及阻塞和非阻塞操作
字符串操作和字符串操作
allocating socket buffers, Packet Reception
degrading by allocating too much memory, get_free_page and Friends
memory barriers and, I/O Registers and Conventional Memory
mmap method, The mmap Device Operation
output buffers and, Blocking and Nonblocking Operations
string operations and, String Operations
外围组件互连、vmalloc 和朋友(请参阅 PCI)
Peripheral Component Interconnect, vmalloc and Friends (see PCI)
外设 (DMA)、直接内存访问与 DMA 控制器对话
peripherals (DMA), Direct Memory AccessTalking to the DMA controller
perror 调用,通过观察进行调试
perror calls, Debugging by Watching
内存的持久化,scull的设计
persistence of memory, The Design of scull
PFN(页框号)、物理地址和页
PFN (page frame number), Physical Addresses and Pages
pfn_to_page 函数,内存映射和结构页
pfn_to_page function, The Memory Map and Struct Page
PG_locked 标志、内存映射和结构页
PG_locked flag, The Memory Map and Struct Page
PG_reserved 标志,内存映射和结构页
PG_reserved flag, The Memory Map and Struct Page
PHYS 变量,输入
PHYS variable, Input
物理地址、地址类型物理地址和页物理地址和页
physical addresses, Address Types, Physical Addresses and Pages, Physical Addresses and Pages
(另见地址)
页、物理地址和页
(see also addresses)
pages, Physical Addresses and Pages
物理内存管理、内存区域大小参数
physical memory management of, Memory zones, The Size Argument
(另见内存)
(see also memory)
引脚、与硬件通信I/O 端口示例示例驱动程序准备并行端口实现处理程序
pins, Communicating with Hardware, An I/O Port Example, A Sample Driver, Preparing the Parallel Port, Implementing a Handler
并行连接器的9/10,准备并行端口
中断(生成),实现处理程序
输出、与硬件通信I/O 端口示例示例驱动程序
9/10 of parallel connector, Preparing the Parallel Port
interrupts (generating), Implementing a Handler
output, Communicating with Hardware, An I/O Port Example, A Sample Driver
管道(双桨),双桨的设计
pipes (scull), The Design of scull
平台依赖性、版本编号平台依赖性平台依赖性平台依赖性/proc 接口
platform dependency, Version Numbering, Platform Dependency, Platform Dependency, Platform Dependencies, The /proc Interface
对于模块,平台依赖性
移植和平台依赖性
/proc/stat 文件,/proc 接口
for modules, Platform Dependency
porting and, Platform Dependencies
/proc/stat file, The /proc Interface
PLIP(并行线互联网协议)、中断处理程序重写 ARP
PLIP (Parallel Line Internet Protocol), The Interrupt Handler, Overriding ARP
使用以太网标头,覆盖 ARP
中断处理差异,中断处理程序
using Ethernet headers, Overriding ARP
interrupt handling differences, The Interrupt Handler
即插即用 (PnP),即插即用规范
Plug and Play (PnP), The Plug-and-Play Specification
PnP(即插即用),即插即用规范
PnP (plug and play), The Plug-and-Play Specification
点对点协议 (PPP) 和中断处理差异,中断处理程序
Point-to-Point Protocol (PPP) and interrupt handling differences, The Interrupt Handler
指针、scull 的内存使用ioctl指针和错误值嵌入 kobjectstty_driver 函数指针没有读取函数?
pointers, scull's Memory Usage, ioctl, Pointers and Error Values, Embedding kobjects, tty_driver Function PointersNo read Function?
数据类型可移植性、指针和错误值
ioctl方法中的inode,ioctl
kobject,嵌入 kobject
scull, scull 的内存使用情况
tty_driver 函数、tty_driver 函数指针未读取函数?
data type portability, Pointers and Error Values
inode in ioctl method, ioctl
kobject, Embedding kobjects
scull, scull's Memory Usage
tty_driver function, tty_driver Function PointersNo read Function?
策略、设备驱动程序的角色设备驱动程序的角色拆分内核安全问题scull 的内存使用scull 的内存使用不使用 ioctl 的设备控制
policies, The Role of the Device DriverThe Role of the Device Driver, Splitting the Kernel, Security Issues, scull's Memory Usagescull's Memory Usage, Device Control Without ioctl
通过打印和不使用 ioctl 的设备控制来控制设备
内存、分割内核scull 的内存使用情况scull 的内存使用情况
分配(scull)、scull 的内存使用情况scull 的内存使用情况
安全,安全问题
与机制分离,设备驱动程序的角色设备驱动程序的角色
controlling devices by printing and, Device Control Without ioctl
memory, Splitting the Kernel, scull's Memory Usagescull's Memory Usage
allocation (scull), scull's Memory Usagescull's Memory Usage
security, Security Issues
separation from mechanism, The Role of the Device DriverThe Role of the Device Driver
poll 方法、文件操作文件操作轮询和选择底层数据结构设备方法
poll method, File Operations, File Operations, poll and selectThe Underlying Data Structure, The Device Methods
poll.h 头文件,poll 和 select快速参考
poll.h header file, poll and select, Quick Reference
POLLERR 标志、轮询和选择
POLLERR flag, poll and select
POLLHUP 标志、轮询和选择
POLLHUP flag, poll and select
POLLIN 标志、轮询和选择
POLLIN flag, poll and select
POLLOUT 标志、轮询和选择
POLLOUT flag, poll and select
POLLPRI 标志、轮询和选择
POLLPRI flag, poll and select
POLLRDBAND 标志、轮询和选择
POLLRDBAND flag, poll and select
POLLRDNORM 标志、轮询和选择
POLLRDNORM flag, poll and select
POLLWRBAND 标志、轮询和选择
POLLWRBAND flag, poll and select
POLLWRNORM 标志、轮询和选择
POLLWRNORM flag, poll and select
poll_controller方法,Netpoll
poll_controller method, Netpoll
poll_table 结构,poll 和 select底层数据结构
poll_table structure, poll and select, The Underlying Data Structure
poll_table_entry 结构,底层数据结构
poll_table_entry structure, The Underlying Data Structure
poll_wait 函数,poll 和 select快速参考
poll_wait function, poll and select, Quick Reference
池、内存池快速参考DMA 池
pools, Memory Pools, Quick Reference, DMA pools
DMA、DMA 池
内存、内存池快速参考
DMA, DMA pools
memory, Memory Pools, Quick Reference
populate 方法,vm_area_struct 结构
populate method, The vm_area_struct structure
可移植性、平台依赖性
portability, Platform Dependencies
移植和平台依赖性
porting and, Platform Dependencies
端口、I/O 端口和 I/O 内存– isa_readb和朋友操作 I/O 端口平台依赖性、并行端口概述示例驱动程序快速参考快速参考快速参考准备并行端口禁用单个中断
ports, I/O Ports and I/O Memoryisa_readb and Friends, Manipulating I/O ports, Platform Dependencies, An Overview of the Parallel PortA Sample Driver, Quick Reference, Quick Reference, Quick Reference, Preparing the Parallel Port, Disabling a single interrupt
(另请参见连接;并行端口)
访问、快速参考
访问不同大小,操作 I/O 端口
I/O、I/O 端口和 I/O 内存isa_readb 和朋友快速参考
并行,并行端口概述-示例驱动程序准备并行端口禁用单个中断
禁用中断处理程序,禁用单个中断
准备中断处理程序,准备并行端口
平台依赖性和平台依赖性
(see also connections; parallel ports)
access, Quick Reference
accessing different sizes, Manipulating I/O ports
I/O, I/O Ports and I/O Memoryisa_readb and Friends, Quick Reference
parallel, An Overview of the Parallel PortA Sample Driver, Preparing the Parallel Port, Disabling a single interrupt
disabling interrupt handlers, Disabling a single interrupt
preparing for interrupt handlers, Preparing the Parallel Port
platform dependency and, Platform Dependencies
POS(可编程选项选择)、MCA
POS (Programmable Option Select), MCA
电源管理,Linux 设备模型
power management, The Linux Device Model
PowerPC 架构(移植和)、平台依赖性
PowerPC architecture (porting and), Platform Dependencies
PPP(点对点协议)和中断处理差异,中断处理程序
PPP (Point-to-Point Protocol) and interrupt handling differences, The Interrupt Handler
pread方法,读写
pread method, read and write
精度、时间、了解当前时间
precision, temporal, Knowing the Current Time
预定义命令 ioctl 方法,预定义命令,预定义命令
predefined commands ioctl method, The Predefined Commands, The Predefined Commands
(另请参阅命令)
(see also commands)
抢占和并发,内核中的并发
preemption and concurrency, Concurrency in the Kernel
打印、通过打印进行调试打印设备编号打开和关闭消息打印设备编号使用 gdb不使用 ioctl 进行设备控制接口特定类型接口特定类型
printing, Debugging by PrintingPrinting Device Numbers, Turning the Messages On and Off, Printing Device Numbers, Using gdb, Device Control Without ioctl, Interface-Specific Types, Interface-Specific Types
通过不使用 ioctl 的设备控制来控制设备
调试代码,打开和关闭消息
设备编号、打印设备编号
来自 gdb 调试器,使用 gdb
接口特定数据、接口特定类型
内核,通过打印进行调试打印设备编号
_t 数据项,接口特定类型
controlling devices by, Device Control Without ioctl
to debug code, Turning the Messages On and Off
device numbers, Printing Device Numbers
from gdb debugger, Using gdb
interface-specific data, Interface-Specific Types
kernels, Debugging by PrintingPrinting Device Numbers
_t data items, Interface-Specific Types
printk 函数、Hello World 模块printk打印设备编号如何记录消息如何记录消息如何记录消息打开和关闭消息seq_file 接口
printk function, The Hello World Module, printkPrinting Device Numbers, How Messages Get Logged, How Messages Get Logged, How Messages Get Logged, Turning the Messages On and Off, The seq_file interface
循环缓冲区,用于如何记录消息
使用如何记录消息进行调试
记录消息,消息如何记录
seq_file 接口(避免 in), seq_file 接口
打开/关闭调试消息,打开和关闭消息
circular buffers for, How Messages Get Logged
debugging with, How Messages Get Logged
logging messages from, How Messages Get Logged
seq_file interface (avoiding in), The seq_file interface
turning debug messages on/off, Turning the Messages On and Off
优先级、printk分配内存标志参数
priorities, printk, Allocating Memory, The Flags Argument
分配、分配内存标志参数
内存、分配内存
allocation, Allocating Memory, The Flags Argument
memory, Allocating Memory
private_data字段(文件结构),文件结构
private_data field (file structure), The file Structure
特权操作、能力和受限操作
privileged operations, Capabilities and Restricted Operations
探测功能(USB)、探测和断开详细信息
probe function (USB), probe and disconnect in Detail
探头,动态,动态探头
Probes, Dynamic, Dynamic Probes
probe_irq_off函数,内核辅助探测
probe_irq_off function, Kernel-assisted probing
probe_irq_on 函数,内核辅助探测
probe_irq_on function, Kernel-assisted probing
探测、自动检测 IRQ 编号自动检测 IRQ 编号内核辅助探测DIY 探测旧式 PCI 探测
probing, Autodetecting the IRQ Number, Autodetecting the IRQ Number, Kernel-assisted probing, Do-it-yourself probing, Old-Style PCI Probing
自己动手,自己动手探索
对于 IRQ 编号,自动检测 IRQ 编号
内核辅助的,内核辅助探测
PCI,旧式 PCI 探测
do-it-yourself, Do-it-yourself probing
for IRQ numbers, Autodetecting the IRQ Number
kernel-assisted, Kernel-assisted probing
PCI, Old-Style PCI Probing
/proc 文件系统,创建 /proc 文件seq_file 接口创建 /proc 文件/proc 接口/proc 接口和共享中断
/proc filesystem, Creating your /proc fileThe seq_file interface, Creating your /proc file, The /proc Interface, The /proc Interface and Shared Interrupts
安装中断处理程序,/proc 接口
删除 /proc 条目,创建 /proc 文件
共享中断以及/proc 接口和共享中断
installing interrupt handlers, The /proc Interface
removing /proc entries, Creating your /proc file
shared interrupts and, The /proc Interface and Shared Interrupts
/proc/*/maps,虚拟内存区域
/proc/*/maps, Virtual Memory Areas
/proc/devices 文件,主号码动态分配
/proc/devices file, Dynamic Allocation of Major Numbers
/proc/interrupts 文件、/proc 接口/proc 接口和共享中断
/proc/interrupts file, The /proc Interface, The /proc Interface and Shared Interrupts
/proc/kcore 文件,使用 gdb
/proc/kcore file, Using gdb
/proc/kmsg 文件,如何记录消息
/proc/kmsg file, How Messages Get Logged
/proc/modules 文件,快速参考
/proc/modules file, Quick Reference
/proc/slabinfo 文件,Lookaside 缓存
/proc/slabinfo file, Lookaside Caches
/proc/stat 文件,/proc 接口
/proc/stat file, The /proc Interface
/proc/sys/kernel/printk 文件,使用printk读取控制台日志级别
/proc/sys/kernel/printk file, reading console loglevel with, printk
/proc/tty/driver/ 目录,TTY 驱动程序
/proc/tty/driver/ directory, TTY Drivers
进程、分割内核可加载模块分割内核当前进程阻塞 I/O测试 Scullpipe 驱动程序设备文件的访问控制单次打开设备内核定时器的实现进程内存映射
processes, Splitting the KernelLoadable Modules, Splitting the Kernel, The Current Process, Blocking I/OTesting the Scullpipe Driver, Access Control on a Device File, Single-Open Devices, The Implementation of Kernel Timers, The Process Memory Map
当前,当前进程
内核定时器,内核定时器的实现
内核(分割),分割内核可加载模块
登录、设备文件访问控制
管理、分割内核
内存映射,进程内存映射
每个进程的打开设备,单打开设备
睡眠,阻塞 I/O测试 Scullpipe 驱动程序
current, The Current Process
kernel timers for, The Implementation of Kernel Timers
kernels (splitting), Splitting the KernelLoadable Modules
login, Access Control on a Device File
managing, Splitting the Kernel
memory maps, The Process Memory Map
opening devices for each process, Single-Open Devices
sleeps, Blocking I/OTesting the Scullpipe Driver
处理器特定寄存器,处理器特定寄存器
processor-specific registers, Processor-Specific Registers
proc_read方法,实现/proc中的文件
proc_read method, Implementing files in /proc
产品变量、输入USB
PRODUCT variable, Input, USB
可编程选项选择 (POS)、MCA
Programmable Option Select (POS), MCA
编程、设置测试系统Hello World 模块Hello World 模块用户空间和内核空间内核中的并发性预备知识在用户空间中进行操作在用户空间中进行操作ISA 编程
programming, Setting Up Your Test System, The Hello World ModuleThe Hello World Module, User Space and Kernel Space, Concurrency in the Kernel, Preliminaries, Doing It in User SpaceDoing It in User Space, ISA Programming
并发性,内核中的并发性
你好世界模块,你好世界模块-你好世界模块
ISA、ISA 编程
模块要求,预备知识
测试系统设置,设置您的测试系统
用户空间、用户空间和内核空间在用户空间中进行操作在用户空间中进行操作
concurrency in, Concurrency in the Kernel
hello world module, The Hello World ModuleThe Hello World Module
ISA, ISA Programming
module requirements, Preliminaries
test system setup, Setting Up Your Test System
user space, User Space and Kernel Space, Doing It in User SpaceDoing It in User Space
程序、设备驱动程序的角色设备驱动程序的角色可加载模块可加载模块许可条款内核模块与应用程序重定向控制台消息Linux 跟踪工具包不使用 ioctl 的设备控制测试 Scullpipe 驱动程序异步通知忙等待标准 C 类型的使用数据对齐/sbin/hotplug 实用程序重新映射 RAM数据包的物理传输
programs, The Role of the Device Driver, The Role of the Device Driver, Loadable Modules, Loadable Modules, License Terms, Kernel Modules Versus Applications, Redirecting Console Messages, The Linux Trace Toolkit, Device Control Without ioctl, Testing the Scullpipe Driver, Asynchronous Notification, Busy waiting, Use of Standard C Types, Data Alignment, The /sbin/hotplug Utility, Remapping RAM, The Physical Transport of Packets
(另请参见应用程序与内核模块)
asynctest,异步通知
数据对齐,数据对齐
数据大小,标准 C 类型的使用
insmod,可加载模块
jitbusy,忙等待
映射器,重新映射 RAM
nbtest,测试 Scullpipe 驱动程序
获取、许可条款
rmmod,可加载模块
/sbin/hotplug 实用程序,/sbin/hotplug 实用程序
setconsole,重定向控制台消息
setterm,不使用 ioctl 的设备控制
tcpdump,数据包的物理传输
跟踪,Linux 跟踪工具包
tunlp,设备驱动程序的作用
(see also applications versus kernel modules)
asynctest, Asynchronous Notification
dataalign, Data Alignment
datasize, Use of Standard C Types
insmod, Loadable Modules
jitbusy, Busy waiting
mapper, Remapping RAM
nbtest, Testing the Scullpipe Driver
obtaining, License Terms
rmmod, Loadable Modules
/sbin/hotplug utility, The /sbin/hotplug Utility
setconsole, Redirecting Console Messages
setterm, Device Control Without ioctl
tcpdump, The Physical Transport of Packets
tracing, The Linux Trace Toolkit
tunelp, The Role of the Device Driver
公共内核符号,内核符号表内核符号表
public kernel symbols, The Kernel Symbol TableThe Kernel Symbol Table
put_unaligned函数,数据对齐
put_unaligned function, Data Alignment
put_user 函数,使用 ioctl 参数快速参考
put_user function, Using the ioctl Argument, Quick Reference
pwrite方法,读写
pwrite method, read and write

Q

量子/量子集(内存),scull 的内存使用情况
quantums/quantum sets (memory), scull's Memory Usage
查询内核,通过查询进行调试ioctl 方法
querying kernels, Debugging by QueryingThe ioctl Method
查询调试,ioctl 方法
querying to debug, The ioctl Method
队列、休眠简介手动休眠快速参考工作队列共享队列工作队列工作队列请求方法简介请求队列队列创建和删除队列函数队列控制函数标记命令队列标记命令队列打开和关闭控制传输并发
queues, Introduction to Sleeping, Manual sleeps, Quick Reference, WorkqueuesThe Shared Queue, Workqueues, Workqueues, Introduction to the request Method, Request Queues, Queue creation and deletion, Queueing functions, Queue control functions, Tagged Command QueueingTagged Command Queueing, Opening and Closing, Controlling Transmission Concurrency
控制功能、队列控制功能
创建/删除,队列创建和删除
功能、排队功能
网络驱动程序、打开和关闭
请求函数,请求方法介绍
请求方法、请求队列
TCQ,标记命令队列标记命令队列
传输,控制传输并发
等等,睡眠简介手动睡眠快速参考
工作队列、工作队列共享队列工作队列工作队列
control functions, Queue control functions
creating/deleting, Queue creation and deletion
functions, Queueing functions
network drivers, Opening and Closing
request function, Introduction to the request Method
request method, Request Queues
TCQ, Tagged Command QueueingTagged Command Queueing
transmissions, Controlling Transmission Concurrency
wait, Introduction to Sleeping, Manual sleeps, Quick Reference
workqueues, WorkqueuesThe Shared Queue, Workqueues, Workqueues

R

竞争条件、内核中的并发模块加载竞争scull 中的陷阱内核计时器
race conditions, Concurrency in the Kernel, Module-Loading Races, Pitfalls in scull, Kernel Timers
内核定时器和内核定时器
模块加载,模块加载竞赛
序列,双桨的陷阱
kernel timers and, Kernel Timers
module loading, Module-Loading Races
sequences, Pitfalls in scull
RAM(随机存取存储器)、I/O 寄存器和传统存储器重映射 RAM
RAM (random access memory), I/O Registers and Conventional Memory, Remapping RAM
重新映射,重新映射 RAM
与 I/O 寄存器、I/O 寄存器和传统存储器的比较
remapping, Remapping RAM
versus I/O registers, I/O Registers and Conventional Memory
随机存取存储器,重新映射 RAM(请参阅 RAM)
random access memory, Remapping RAM (see RAM)
随机数,安装中断处理程序
random numbers, Installing an Interrupt Handler
速率、限制、速率限制
rates, limitations of, Rate Limiting
RCU(读取-复制-更新),读取-复制-更新
RCU (read-copy-update), Read-Copy-Update
rdtscl 函数,处理器特定寄存器
rdtscl function, Processor-Specific Registers
读取功能(tty 驱动程序),没有读取功能?
read function (tty drivers), No read Function?
读取方法、文件操作文件结构读取和写入读取方法读取方法通过观察进行调试Oops 消息从设备读取数据与 DMA 控制器对话
read method, File Operations, The file Structure, read and write, The read Method, The read Method, Debugging by Watching, Oops Messages, Reading data from the device, Talking to the DMA controller
参数、读取和写入
read 方法的代码
配置 DMA 控制器、与 DMA 控制器对话
f_pos 字段(文件结构)和,文件结构
哎呀消息,哎呀消息
poll方法以及,从设备读取数据
返回值、解释规则、read 方法
strace 命令和通过观察进行调试
arguments to, read and write
code for, The read Method
configuring DMA controllers, Talking to the DMA controller
f_pos field (file structure) and, The file Structure
oops messages, Oops Messages
poll method and, Reading data from the device
return values, rules for interpreting, The read Method
strace command and, Debugging by Watching
读-复制-更新 (RCU),读-复制-更新
read-copy-update (RCU), Read-Copy-Update
只读 /proc 文件,在 /proc 中创建、实现文件
read-only /proc files, creating, Implementing files in /proc
读/写指令、重新排序、I/O 寄存器和传统存储器
read/write instructions, reordering, I/O Registers and Conventional Memory
读/写位置、更改、文件操作
read/write position, changing, File Operations
readdir方法,文件操作
readdir method, File Operations
读取器/写入器信号量、读取器/写入器信号量
reader/writer semaphores, Reader/Writer Semaphores
读取器/写入器自旋锁、读取器/写入器自旋锁
reader/writer spinlocks, Reader/Writer Spinlocks
读、阻塞和非阻塞操作
reading, Blocking and Nonblocking Operations
阻塞/非阻塞操作、阻塞和非阻塞操作
blocking/nonblocking operations, Blocking and Nonblocking Operations
readv 调用、readv 和 writev
readv calls, readv and writev
readv方法,文件操作
readv method, File Operations
read_proc函数,实现/proc中的文件
read_proc function, Implementing files in /proc
rebuild_header 方法,设备方法
rebuild_header method, The Device Methods
数据包的接收、数据包的物理传输数据包的物理传输
reception of packets, The Physical Transport of Packets, The Physical Transport of Packets
恢复、错误、初始化期间的错误处理
recovery, error, Error Handling During Initialization
重定向控制台消息,重定向控制台消息
redirecting console messages, Redirecting Console Messages
可重入、内核并发系统挂起
reentrant, Concurrency in the Kernel, System Hangs
呼叫,系统挂起
代码,内核中的并发
calls, System Hangs
code, Concurrency in the Kernel
引用计数器 (kobjects)、引用计数操作
reference counters (kobjects), Reference count manipulation
区域、访问 I/O 和内存空间重新映射特定 I/O 区域
regions, Accessing the I/O and Memory Spaces, Remapping Specific I/O Regions
通用 I/O 地址空间,访问 I/O 和内存空间
I/O内存管理,重新映射特定I/O区域
generic I/O address spaces, Accessing the I/O and Memory Spaces
I/O memory management, Remapping Specific I/O Regions
寄存器,处理器特定寄存器,处理器特定寄存器, I/O 寄存器和传统存储器,配置寄存器和初始化,配置寄存器和初始化,配置寄存器和初始化,配置寄存器和初始化,配置寄存器和初始化,配置寄存器和初始化,快速参考DMA 映射分散/聚集映射分散/聚集映射ioctlsioctls
registers, Processor-Specific Registers, Processor-Specific Registers, I/O Registers and Conventional Memory, Configuration Registers and Initialization, Configuration Registers and Initialization, Configuration Registers and Initialization, Configuration Registers and Initialization, Configuration Registers and Initialization, Configuration Registers and Initialization, Quick Reference, DMA mappings, Scatter/gather mappings, Scatter/gather mappings, ioctls, ioctls
计数器、处理器特定寄存器
I/O、I/O 寄存器和常规存储器
LSR、IOCTL
映射、DMA 映射分散/聚集映射
MSR、IOCTL
PCI,配置寄存器和初始化,配置寄存器和初始化,配置寄存器和初始化,配置寄存器和初始化,配置寄存器和初始化,配置寄存器和初始化,快速参考
类、配置寄存器和初始化
deviceID、配置寄存器和初始化
子系统设备ID、配置寄存器和初始化
子系统供应商ID、配置寄存器和初始化
供应商ID、配置寄存器和初始化
处理器特定的、处理器特定的寄存器
分散列表(和映射)、分散/聚集映射
counters, Processor-Specific Registers
I/O, I/O Registers and Conventional Memory
LSR, ioctls
mapping, DMA mappings, Scatter/gather mappings
MSR, ioctls
PCI, Configuration Registers and Initialization, Configuration Registers and Initialization, Configuration Registers and Initialization, Configuration Registers and Initialization, Configuration Registers and Initialization, Configuration Registers and Initialization, Quick Reference
class, Configuration Registers and Initialization
deviceID, Configuration Registers and Initialization
subsystem deviceID, Configuration Registers and Initialization
subsystem vendorID, Configuration Registers and Initialization
vendorID, Configuration Registers and Initialization
processor-specific, Processor-Specific Registers
scatterlists (and mapping), Scatter/gather mappings
register_blkdev函数,块驱动注册
register_blkdev function, Block Driver Registration
register_chrdev 函数,udev
register_chrdev function, udev
register_netdev函数,初始化每个设备
register_netdev function, Initializing Each Device
注册、清理函数初始化期间的错误处理模块加载竞争Char 设备注册旧方法快速参考注册 PCI 驱动程序注册 USB 驱动程序注册 USB 驱动程序总线注册、设备注册注册 DMA用法注册关于扇区大小的说明磁盘注册设备注册小型 TTY 驱动程序结构体termios
registration, The Cleanup Function, Error Handling During Initialization, Module-Loading Races, Char Device RegistrationThe Older Way, Quick Reference, Registering a PCI Driver, Registering a USB Driver, Registering a USB Driver, Bus registration, Device registration, Registering DMA usage, RegistrationA Note on Sector Sizes, Disk Registration, Device Registration, A Small TTY Driver, struct termios
区块驱动程序、注册关于扇区大小的说明
巴士,巴士登记
char 驱动程序、Char 设备注册旧方法
清理函数,清理函数
设备、设备注册设备注册
磁盘、磁盘注册
DMA 使用、注册 DMA 使用
中断处理程序,快速参考
模块加载竞赛,模块加载竞赛
PCI 驱动程序,注册 PCI 驱动程序
struct usb_driver结构体,注册USB驱动程序
tiny_tty_driver 变量,struct termios
初始化期间的跟踪、错误处理
tty 驱动程序,小型 TTY 驱动程序
USB 驱动程序,注册 USB 驱动程序
block drivers, RegistrationA Note on Sector Sizes
buses, Bus registration
char drivers, Char Device RegistrationThe Older Way
cleanup function, The Cleanup Function
devices, Device registration, Device Registration
disks, Disk Registration
DMA usage, Registering DMA usage
interrupt handlers, Quick Reference
module-loading races, Module-Loading Races
PCI drivers, Registering a PCI Driver
struct usb_driver structure, Registering a USB Driver
tiny_tty_driver variable, struct termios
tracking, Error Handling During Initialization
tty drivers, A Small TTY Driver
USB drivers, Registering a USB Driver
释放电话、单开设备
release calls, Single-Open Devices
释放函数(kobject)、释放函数和 kobject 类型
release functions (kobjects), Release functions and kobject types
释放方法、文件操作释放方法释放方法释放方法阻塞打开作为 EBUSY 的替代方案在打开时克隆设备释放函数和 kobject 类型打开和释放方法
release method, File Operations, The release Method, The release Method, The release Method, Blocking open as an Alternative to EBUSY, Cloning the Device on open, Release functions and kobject types, The open and release Methods
块驱动程序的打开和释放方法
阻塞,阻塞打开作为 EBUSY 的替代方案
克隆设备,在打开时克隆设备
kobject、Release 函数和 kobject 类型
block drivers, The open and release Methods
blocking, Blocking open as an Alternative to EBUSY
cloning devices, Cloning the Device on open
kobjects, Release functions and kobject types
release_dma_lock函数,与DMA控制器对话
release_dma_lock function, Talking to the DMA controller
释放自旋锁,自旋锁函数
releasing spinlocks, The Spinlock Functions
RELEVANT_IFLAG 宏、set_termios
RELEVANT_IFLAG macro, set_termios
重新映射、重新映射 RAM重新映射 RAM重新映射内核虚拟地址
remapping, Remapping RAM, Remapping RAM, Remapping Kernel Virtual Addresses
(另请参阅映射)
内核虚拟地址,重新映射内核虚拟地址
RAM、重新映射 RAM
(see also mapping)
kernel virtual addresses, Remapping Kernel Virtual Addresses
RAM, Remapping RAM
remap_pfn_range函数,使用remap_pfn_range
remap_pfn_range function, Using remap_pfn_range
Remote0(IP 号码),分配 IP 号码
remote0 (IP number), Assigning IP Numbers
可移动媒体(支持),支持可移动媒体
removable media (supporting), Supporting Removable Media
remove_proc_entry 函数,创建 /proc 文件
remove_proc_entry function, Creating your /proc file
重新排序读/写指令、I/O 寄存器和传统内存
reordering read/write instructions, I/O Registers and Conventional Memory
报告(中断),安装中断处理程序
reports (interrupts), Installing an Interrupt Handler
请求,阻塞打开作为 EBUSY 的替代方案请求处理无需请求队列bio 结构
requests, Blocking open as an Alternative to EBUSY, Request ProcessingDoing without a request queue, The bio structure
阻塞,阻塞打开作为 EBUSY 的替代方案
处理,请求处理没有请求队列
(处理)状态,生物结构
blocking, Blocking open as an Alternative to EBUSY
processing, Request ProcessingDoing without a request queue
state of (processing), The bio structure
request_dma 函数,注册 DMA 使用
request_dma function, Registering DMA usage
request_firmware 函数,内核固件接口
request_firmware function, The Kernel Firmware Interface
重新排队/重新安排任务、内核计时器
requeuing/rescheduling tasks, Kernel Timers
要求、代码、预备知识
requirements, code, Preliminaries
时间分辨率,了解当前时间
resolution of time, Knowing the Current Time
解析以太网地址、MAC地址解析
resolving Ethernet addresses, MAC Address Resolution
资源标志 (PCI),访问 I/O 和内存空间
resource flags (PCI), Accessing the I/O and Memory Spaces
访问限制,一次限制单个用户的访问
restriction of access, Restricting Access to a Single User at a Time
检索当前时间,了解当前时间了解当前时间
retrieval of current time, Knowing the Current TimeKnowing the Current Time
返回值、返回值处理程序参数和返回值
return values, The Return Value, Handler Arguments and Return Value
中断处理程序、处理程序参数和返回值
switch 语句,返回值
interrupt handlers, Handler Arguments and Return Value
switch statements, The Return Value
重新验证方法,支持可移动媒体
revalidate method, Supporting Removable Media
环形缓冲区 (DMA),DMA 数据传输概述
ring buffers (DMA), Overview of a DMA Data Transfer
RISC 处理器和内联汇编代码、处理器特定寄存器
RISC processor and inline assembly code, Processor-Specific Registers
rmmod 程序、可加载模块可加载模块Hello World 模块Hello World 模块主编号动态分配
rmmod program, Loadable Modules, Loadable Modules, The Hello World Module, The Hello World Module, Dynamic Allocation of Major Numbers
动态分配主号码, Dynamic Allocation of Major Numbers
使用Hello World 模块测试模块
dynamically allocating major numbers, Dynamic Allocation of Major Numbers
testing modules using, The Hello World Module
角色、设备驱动程序的角色设备驱动程序的角色拆分内核可加载模块
roles, The Role of the Device DriverThe Role of the Device Driver, Splitting the KernelLoadable Modules
设备驱动程序,设备驱动程序的角色设备驱动程序的角色
内核,拆分内核可加载模块
of device drivers, The Role of the Device DriverThe Role of the Device Driver
kernels, Splitting the KernelLoadable Modules
根集线器 (USB)、USB 和 Sysfs
root hubs (USB), USB and Sysfs
路由、网络管理、拆分内核
routing, network management, Splitting the Kernel
rq_data_dir字段(请求结构体),一个简单的请求方法
rq_data_dir field (request structure), A Simple request Method
规则、模糊规则锁排序规则
rules, Ambiguous Rules, Lock Ordering Rules
锁定、不明确的规则
排序,锁定排序规则
locking, Ambiguous Rules
ordering, Lock Ordering Rules
跑步,安装共享处理程序(参见执行)
running, Installing a Shared Handler (see execution)
运行时、代码、可加载模块
runtime, code, Loadable Modules
rwsems(读取器/写入器信号量),读取器/写入器信号量
rwsems (reader/writer semaphores), Reader/Writer Semaphores

S

S

S/390 架构、平台依赖性S/390 和 zSeries
S/390 architecture, Platform Dependencies, S/390 and zSeries
移植和平台依赖性
porting and, Platform Dependencies
SAK(安全注意密钥)功能,系统挂起
SAK (secure attention key) function, System Hangs
示例程序,获取,许可条款
sample programs, obtaining, License Terms
SA_INTERRUPT 标志,安装中断处理程序快速参考
SA_INTERRUPT flag, Installing an Interrupt Handler, Quick Reference
SA_SAMPLE_RANDOM 标志,安装中断处理程序快速参考
SA_SAMPLE_RANDOM flag, Installing an Interrupt Handler, Quick Reference
SA_SHIRQ 标志、安装中断处理程序安装共享处理程序快速参考
SA_SHIRQ flag, Installing an Interrupt Handler, Installing a Shared Handler, Quick Reference
/sbin/hotplug 实用程序,/sbin/hotplug 实用程序
/sbin/hotplug utility, The /sbin/hotplug Utility
sbull 驱动程序,sbull 中的初始化一个简单的请求方法
sbull drivers, Initialization in sbull, A Simple request Method
初始化,在sbull中初始化
请求方法,一个简单的请求方法
initialization, Initialization in sbull
request method, A Simple request Method
sbull ioctl 方法, ioctl 方法
sbull ioctl method, The ioctl Method
sbull_request函数,在sbull中初始化
sbull_request function, Initialization in sbull
系统总线、系统总线
SBus, SBus
分散/聚集、分散/聚集映射分散/聚集 I/O
scatter/gather, Scatter/gather mappings, Scatter/Gather I/O
DMA 映射、分散/聚集映射
I/O、分散/聚集 I/O
DMA mappings, Scatter/gather mappings
I/O, Scatter/Gather I/O
分散列表、分散/聚集映射分散/聚集映射直接内存访问
scatterlists, Scatter/gather mappings, Scatter/gather mappings, Direct Memory Access
映射、分散/聚集映射分散/聚集映射
结构,直接内存访问
mapping, Scatter/gather mappings, Scatter/gather mappings
structure, Direct Memory Access
sched.h 头文件,快速参考测量时间间隔
sched.h header file, Quick Reference, Measuring Time Lapses
调度功能、系统挂起快速参考让出处理器
schedule function, System Hangs, Quick Reference, Yielding the processor
执行代码(延迟),让出处理器
防止无限循环,系统挂起
execution of code (delaying), Yielding the processor
preventing endless loops with, System Hangs
调度程序 (I/O)、请求队列
schedulers (I/O), Request Queues
Schedule_timeout 函数,超时
schedule_timeout function, Timeouts
调度内核定时器,内核定时器内核定时器的实现
scheduling kernel timers, Kernel TimersThe Implementation of Kernel Timers
脚本(热插拔),Linux 热插拔脚本
scripts (hotplug), Linux hotplug scripts
SCSI ,设备和模块类别
SCSI, Classes of Devices and Modules, SCSI
设备、SCSI
模块、设备和模块类别
devices, SCSI
modules, Classes of Devices and Modules
scull、Char 驱动scull 设计主号码动态分配文件操作文件操作inode 结构scull 中的设备注册open 方法open 方法release 方法scull 的内存使用scull 的内存使用scull 的内存使用情况读写读写readv 和 writevreadv 和 writev使用新设备打开和关闭消息,在 /proc 中实现文件, seq_file 接口, seq_file 接口, seq_file 接口, scull 中的陷阱, scull 中的陷阱,并发及其管理,信号量和互斥体,在 scull 中使用信号量,选择 ioctl命令
scull, Char Drivers, The Design of scull, Dynamic Allocation of Major Numbers, File OperationsFile Operations, The inode Structure, Device Registration in scull, The open MethodThe open Method, The release Method, scull's Memory Usagescull's Memory Usage, scull's Memory Usage, read and write, read and write, readv and writev, readv and writev, Playing with the New Devices, Turning the Messages On and Off, Implementing files in /proc, The seq_file interface, The seq_file interface, The seq_file interface, Pitfalls in scull, Pitfalls in scull, Concurrency and Its Management, Semaphores and Mutexes, Using Semaphores in scull, Choosing the ioctl Commands
字符驱动程序,使用新设备
并发性、并发性及其管理(参见并发性)
设计,双桨的设计
设备注册,scull 中的设备注册
驱动程序(示例)、打开和关闭消息选择 ioctl 命令
文件操作,文件操作文件操作
索引节点结构,索引节点结构
锁定(添加)、信号量和互斥体
内存,scull 的内存使用情况scull 的内存使用情况scull 中的陷阱
故障排除,scull 中的陷阱
scull 的内存使用情况– scull的内存使用情况
next 方法,seq_file 接口
open 方法, open 方法open 方法
指针、scull 的内存使用情况
比赛条件,双桨的陷阱
read方法,读写
readv 调用、readv 和 writev
read_proc方法,实现/proc中的文件
释放方法,释放方法
信号量,在 scull 中使用信号量
show 方法,seq_file 接口
停止方法, seq_file 接口
write方法,读写
writev 调用、readv 和 writev
char drivers, Playing with the New Devices
concurrency, Concurrency and Its Management (see concurrency)
design of, The Design of scull
device registration, Device Registration in scull
drivers (example), Turning the Messages On and Off, Choosing the ioctl Commands
file operations, File OperationsFile Operations
inode structure, The inode Structure
locking (adding), Semaphores and Mutexes
memory, scull's Memory Usagescull's Memory Usage, Pitfalls in scull
troubleshooting, Pitfalls in scull
usage, scull's Memory Usagescull's Memory Usage
next method, The seq_file interface
open method, The open MethodThe open Method
pointers, scull's Memory Usage
race conditions, Pitfalls in scull
read method, read and write
readv calls, readv and writev
read_proc method, Implementing files in /proc
release method, The release Method
semaphores, Using Semaphores in scull
show method, The seq_file interface
stop method, The seq_file interface
write method, read and write
writev calls, readv and writev
scull 驱动程序(示例)、Char 驱动程序
scull driver (example), Char Drivers
scullc 驱动程序(示例),基于 Slab 缓存的 scull:scullc
scullc driver (example), A scull Based on the Slab Caches: scullc
scullp,使用整个页面的 scull: scullp使用 nopage 方法重新映射 RAM
scullp, A scull Using Whole Pages: scullp, Remapping RAM with the nopage method
例如,使用整个页面的 scull:scullp
mmap 实现,使用 nopage 方法重新映射 RAM
example, A scull Using Whole Pages: scullp
mmap implementations, Remapping RAM with the nopage method
scullpipe 设备(示例),阻塞 I/O 示例测试 Scullpipe 驱动程序
scullpipe devices (example), A Blocking I/O ExampleTesting the Scullpipe Driver
双桨单设备、单开设备
scullsingle device, Single-Open Devices
sculluid 代码,一次限制单个用户的访问
sculluid code, Restricting Access to a Single User at a Time
scullv 驱动程序(示例),使用虚拟地址的 scull: scullv快速参考
scullv driver (example), A scull Using Virtual Addresses: scullv, Quick Reference
scull_cleanup 函数,在打开时克隆设备
scull_cleanup function, Cloning the Device on open
scull_getwritespace 函数,手动休眠
scull_getwritespace function, Manual sleeps
搜索 PCI 驱动程序、快速参考
searching PCI drivers, Quick Reference
扇区(大小),扇区大小注释
sectors (size of), A Note on Sector Sizes
ector_t bi_sector字段(bio结构体),bio结构体
sector_t bi_sector field (bio structure), The bio structure
ector_t 容量字段(gendisk),gendisk 结构
sector_t capacity field (gendisk), The gendisk structure
扇区_t扇区字段(请求结构体),一种简单的请求方法
sector_t sector field (request structure), A Simple request Method
安全注意密钥 (SAK) 功能,系统挂起
secure attention key (SAK) function, System Hangs
安全,安全问题,安全问题
security, Security Issues, Security Issues
寻找设备,寻找设备
seeking devices, Seeking a Device
select 方法、文件操作轮询和选择底层数据结构
select method, File Operations, poll and selectThe Underlying Data Structure
poll 方法和文件操作
poll method and, File Operations
信号量、信号量和互斥体信号量和互斥体Linux 信号量实现读取器/写入器信号量读取器/写入器信号量、完成完成
semaphores, Semaphores and Mutexes, Semaphores and Mutexes, The Linux Semaphore ImplementationReader/Writer Semaphores, Reader/Writer Semaphores, CompletionsCompletions
完成,完成完成
实现,Linux 信号量实现读取器/写入器信号量
读取器/写入器、读取器/写入器信号量
解锁、信号量和互斥体
completion, CompletionsCompletions
implementation, The Linux Semaphore ImplementationReader/Writer Semaphores
reader/writer, Reader/Writer Semaphores
unlocking, Semaphores and Mutexes
sendfile系统,文件操作
sendfile system, File Operations
sendpage系统,文件操作
sendpage system, File Operations
序列锁,序列锁
seqlocks, seqlocks
SEQNUM 变量,/sbin/hotplug 实用程序
SEQNUM variable, The /sbin/hotplug Utility
序列(竞争条件),双桨中的陷阱
sequences (race conditions), Pitfalls in scull
seq_file 接口, seq_file 接口seq_file 接口
seq_file interface, The seq_file interfaceThe seq_file interface
串行线配置、ioctls
serial line configuration, ioctls
Serial_icounter_struct结构,ioctls
serial_icounter_struct structure, ioctls
setconsole 程序,重定向控制台消息
setconsole program, Redirecting Console Messages
setterm 程序,无需 ioctl 的设备控制
setterm program, Device Control Without ioctl
设置位操作,位操作
set_bit operation, Bit Operations
set_config 方法,设备方法
set_config method, The Device Methods
set_dma_addr函数,与DMA控制器对话
set_dma_addr function, Talking to the DMA controller
set_dma_count函数,与DMA控制器对话
set_dma_count function, Talking to the DMA controller
set_dma_mode函数,与DMA控制器对话
set_dma_mode function, Talking to the DMA controller
set_mac_address 方法,设备方法
set_mac_address method, The Device Methods
set_mb 函数、I/O 寄存器和常规内存
set_mb function, I/O Registers and Conventional Memory
set_multicast_list函数,典型实现
set_multicast_list function, A Typical Implementation
set_multicast_list方法,接口信息设备方法
set_multicast_list method, Interface Information, The Device Methods
set_rmb 函数、I/O 寄存器和常规内存
set_rmb function, I/O Registers and Conventional Memory
set_termios 函数, set_termios
set_termios function, set_termios
set_wmb 函数、I/O 寄存器和常规内存
set_wmb function, I/O Registers and Conventional Memory
sfile 参数,seq_file 接口
sfile argument, The seq_file interface
sg_dma_address 函数,直接内存访问
sg_dma_address function, Direct Memory Access
sg_dma_address 宏,分散/聚集映射
sg_dma_address macro, Scatter/gather mappings
sg_dma_len 函数,直接内存访问
sg_dma_len function, Direct Memory Access
sg_dma_len 宏,分散/聚集映射
sg_dma_len macro, Scatter/gather mappings
共享、并发及其管理共享队列中断共享/proc 接口和共享中断
sharing, Concurrency and Its Management, The Shared Queue, Interrupt SharingThe /proc Interface and Shared Interrupts
代码、并发及其管理
中断处理程序、中断共享/proc 接口和共享中断
队列,共享队列
code, Concurrency and Its Management
interrupt handlers, Interrupt SharingThe /proc Interface and Shared Interrupts
queues, The Shared Queue
短暂的延误,短暂的延误
short delays, Short Delays
睡眠,短暂延迟
sleeps, Short Delays
短驱动程序(示例)、示例驱动程序重用 I/O 内存的短格式安装中断处理程序自行探测实现处理程序
short driver (example), A Sample Driver, Reusing short for I/O Memory, Installing an Interrupt Handler, Do-it-yourself probing, Implementing a Handler
访问 I/O 内存,重用 I/O 内存的缩写
实现中断处理程序,实现处理程序
安装中断处理程序,安装中断处理程序
探测,自己动手探测
accessing I/O memory, Reusing short for I/O Memory
implementing interrupt handlers, Implementing a Handler
installing interrupt handlers, Installing an Interrupt Handler
probing, Do-it-yourself probing
短模块,内核辅助探测
short module, Kernel-assisted probing
Shortprint 驱动程序,写缓冲示例写缓冲示例
shortprint drivers, A Write-Buffering ExampleA Write-Buffering Example
显示函数、驱动结构嵌入
show function, Driver structure embedding
show 方法,seq_file 接口默认属性
show method, The seq_file interface, Default Attributes
kobjects,默认属性
seq_file 接口, seq_file 接口
kobjects, Default Attributes
seq_file interface, The seq_file interface
关闭、初始化和关闭Linux 设备模型
shutdown, Initialization and Shutdown, The Linux Device Model
SIGIO信号,异步通知
SIGIO signal, Asynchronous Notification
信号处理,阻塞 I/O 示例
signal handling, A Blocking I/O Example
用于加载 Localitie、Char 驱动程序的简单字符实用程序(请参阅 scull)
Simple Character Utility for Loading Localitie, Char Drivers (see scull)
简单的硬件操作和原始测试,示例驱动程序(请参阅简短的驱动程序)
Simple Hardware Operations and Raw Tests, A Sample Driver (see short driver)
简单的睡觉,简单的睡觉
simple sleeping, Simple Sleeping
单开设备、单开设备
single-open devices, Single-Open Devices
单页流式映射,单页流式映射
single-page streaming mappings, Single-page streaming mappings
SIOCDEVPRIVATE 命令,自定义 ioctl 命令
SIOCDEVPRIVATE commands, Custom ioctl Commands
SIOCSIFADDR 命令,自定义 ioctl 命令
SIOCSIFADDR command, Custom ioctl Commands
SIOCSIFMAP 命令,自定义 ioctl 命令
SIOCSIFMAP command, Custom ioctl Commands
size、大小参数操作 I/O 端口为数据项分配显式大小为数据项分配显式大小页大小关于扇区大小的注释
size, The Size Argument, Manipulating I/O ports, Assigning an Explicit Size to Data Items, Assigning an Explicit Size to Data Items, Page Size, A Note on Sector Sizes
显式数据,为数据项分配显式大小
显式,为数据项分配显式大小
kmalloc 参数、大小参数
页数、页面大小
端口,操作 I/O 端口
部门数量,关于部门规模的说明
data explicitly, Assigning an Explicit Size to Data Items
explicit, Assigning an Explicit Size to Data Items
kmalloc argument, The Size Argument
pages, Page Size
ports, Manipulating I/O ports
of sectors, A Note on Sector Sizes
skbuff.h头文件,数据包传输
skbuff.h header file, Packet Transmission
skb_headlen 函数,作用于套接字缓冲区的函数
skb_headlen function, Functions Acting on Socket Buffers
skb_headroom 函数,作用于套接字缓冲区的函数
skb_headroom function, Functions Acting on Socket Buffers
skb_is_nonlinear 函数,作用于套接字缓冲区的函数
skb_is_nonlinear functions, Functions Acting on Socket Buffers
skb_pull 函数,作用于套接字缓冲区的函数
skb_pull function, Functions Acting on Socket Buffers
skb_push 函数,作用于套接字缓冲区的函数
skb_push function, Functions Acting on Socket Buffers
skb_put 函数,作用于套接字缓冲区的函数
skb_put function, Functions Acting on Socket Buffers
skb_reserve 函数,作用于套接字缓冲区的函数
skb_reserve function, Functions Acting on Socket Buffers
skb_tailroom 函数,作用于套接字缓冲区的函数
skb_tailroom function, Functions Acting on Socket Buffers
sk_buff结构,数据包传输重要字段
sk_buff structure, Packet Transmission, The Important Fields
字段,重要字段
传输数据包,数据包传输
fields for, The Important Fields
transmitting packets, Packet Transmission
SLAB_CACHE_DMA 标志,后备缓存
SLAB_CACHE_DMA flag, Lookaside Caches
SLAB_CTOR_ATOMIC 标志,后备缓存
SLAB_CTOR_ATOMIC flag, Lookaside Caches
SLAB_CTOR_CONSTRUCTOR 标志,后备缓存
SLAB_CTOR_CONSTRUCTOR flag, Lookaside Caches
SLAB_HWCACHE_ALIGN 标志,后备缓存
SLAB_HWCACHE_ALIGN flag, Lookaside Caches
SLAB_NO_REAP 标志,后备缓存
SLAB_NO_REAP flag, Lookaside Caches
睡眠、信号量和互斥体自旋锁和原子上下文阻塞 I/O测试 Scullpipe 驱动程序手动睡眠短延迟
sleeps, Semaphores and Mutexes, Spinlocks and Atomic Context, Blocking I/OTesting the Scullpipe Driver, Manual sleeps, Short Delays
锁定、信号量和互斥体
手动,手动睡眠
进程,阻塞 I/O测试 Scullpipe 驱动程序
短暂的延误,短暂的延误
自旋锁、自旋锁和原子上下文
locking, Semaphores and Mutexes
manual, Manual sleeps
processes, Blocking I/OTesting the Scullpipe Driver
short delays, Short Delays
spinlocks, Spinlocks and Atomic Context
sleep_on 函数,古代历史:sleep_on
sleep_on function, Ancient history: sleep_on
减慢速度(避免),通过查询进行调试
slow downs (avoiding), Debugging by Querying
慢速中断处理程序、快速和慢速处理程序
slow interrupt handlers, Fast and Slow Handlers
SMP(对称多处理器)系统,内核中的并发性
SMP (symmetric multiprocessor) systems, Concurrency in the Kernel
snullnet0(IP 号),分配 IP 号
snullnet0 (IP number), Assigning IP Numbers
套接字缓冲区、数据包传输数据包接收套接字缓冲区作用于套接字缓冲区的函数
socket buffers, Packet Transmission, Packet Reception, The Socket BuffersFunctions Acting on Socket Buffers
分配、数据包接收
allocation, Packet Reception
软件、版本编号短延迟短延迟
software, Version Numbering, Short Delays, Short Delays
(另请参见应用程序与内核模块)
循环、短延迟
版本,短延迟(参见版本,编号)
(see also applications versus kernel modules)
loops, Short Delays
versions, Short Delays (see versions, numbering)
软件映射I/O内存(ioremap函数)、I/O内存分配和映射
software-mapped I/O memory (ioremap function), I/O Memory Allocation and Mapping
SPARC 架构、平台依赖性
SPARC architecture, Platform Dependencies
SPARC64平台(数据对齐),数据对齐
SPARC64 platform (data alignment), Data Alignment
特殊文件,主要和次要编号
special files, Major and Minor Numbers
自旋锁、自旋锁函数与 DMA 控制器对话实用程序字段控制传输并发
spinlocks, The Spinlock Functions, Talking to the DMA controller, Utility Fields, Controlling Transmission Concurrency
dma_spin_lock,与 DMA 控制器对话
hard_start_xmit函数,控制传输并发
释放,自旋锁函数
xmit_lock 函数,实用程序字段
dma_spin_lock, Talking to the DMA controller
hard_start_xmit function, Controlling Transmission Concurrency
releasing, The Spinlock Functions
xmit_lock function, Utility Fields
分割内核,分割内核可加载模块
splitting kernels, Splitting the KernelLoadable Modules
堆叠模块,内核符号表内核符号表
stacking modules, The Kernel Symbol Table, The Kernel Symbol Table
标准 C 数据类型、标准 C 类型的使用
standard C data types, Use of Standard C Types
start 方法,seq_file 接口
start method, The seq_file interface
stat 文件,/proc 接口
stat file, The /proc Interface
请求处理状态,bio 结构
state of request processing, The bio structure
语句,初始化期间的错误处理printkioctl返回值
statements, Error Handling During Initialization, printk, ioctl, The Return Value
转到,初始化期间的错误处理
printk, printk(参见 printk 函数)
switch、ioctl返回值
使用ioctl方法,ioctl
返回值,返回值
goto, Error Handling During Initialization
printk, printk (see printk function)
switch, ioctl, The Return Value
with ioctl method, ioctl
return values, The Return Value
静态函数(锁定),模糊规则
static functions (locking), Ambiguous Rules
静态号码、分配、主号码动态分配
static numbers, assignment of, Dynamic Allocation of Major Numbers
统计、后备缓存/proc 接口初始化每个设备设备方法统计信息统计信息
statistics, Lookaside Caches, The /proc Interface, Initializing Each Device, The Device Methods, Statistical Information, Statistical Information
关于缓存、Lookaside Cache
关于中断,/proc 接口
关于网络驱动程序,统计信息
关于网络接口、初始化每个设备设备方法统计信息
on caches, Lookaside Caches
on interrupts, The /proc Interface
on network drivers, Statistical Information
on network interfaces, Initializing Each Device, The Device Methods, Statistical Information
状态信息、实用程序字段
status information, Utility Fields
stop 方法、seq_file 接口设备方法
stop method, The seq_file interface, The Device Methods
存储方法(kobjects),默认属性
store method (kobjects), Default Attributes
strace 命令,通过观察进行调试
strace command, Debugging by Watching
strace 工具,测试 Scullpipe 驱动程序
strace tool, Testing the Scullpipe Driver
流式传输、DMA 映射设置流式 DMA 映射单页流式映射
streaming, DMA mappings, Setting up streaming DMA mappings, Single-page streaming mappings
DMA 映射、DMA 映射设置流 DMA 映射
单页映射、单页流映射
DMA mappings, DMA mappings, Setting up streaming DMA mappings
single-page mappings, Single-page streaming mappings
字符串操作、字符串操作快速参考
string operations, String Operations, Quick Reference
struct block_device_operations *fops 字段 (gendisk), gendisk 结构体
struct block_device_operations *fops field (gendisk), The gendisk structure
struct bus_type *总线字段,设备
struct bus_type *bus field, Devices
struct cdev *i_cdev(inode结构字段),inode结构
struct cdev *i_cdev (inode structure field), The inode Structure
struct dentry *f_dentry (struct file field),文件结构
struct dentry *f_dentry (struct file field), The file Structure
struct device *父字段,设备
struct device *parent field, Devices
结构设备字段,设备
struct device fields, Devices
struct device_driver *驱动程序字段,设备
struct device_driver *driver field, Devices
struct file,文件结构
struct file, The file Structure
struct file_operations *fops 变量(USB),详细探测和断开连接
struct file_operations *fops variable (USB), probe and disconnect in Detail
struct file_operations *f_op (struct file field),文件结构
struct file_operations *f_op (struct file field), The file Structure
struct kobject kobj 字段,设备
struct kobject kobj field, Devices
struct module *owner函数,注册USB驱动
struct module *owner function, Registering a USB Driver
struct module *owner方法,文件操作
struct module *owner method, File Operations
struct net_device *下一个字段(net_device结构),全局信息
struct net_device *next field (net_device structure), Global Information
struct pci_device_id 结构(PCI),配置寄存器和初始化
struct pci_device_id structure (PCI), Configuration Registers and Initialization
struct request 结构体,一个简单的 request 方法
struct request structure, A Simple request Method
struct request_queue *queue 字段 (gendisk), gendisk 结构体
struct request_queue *queue field (gendisk), The gendisk structure
struct scull_qset结构体,scull的内存使用情况
struct scull_qset structure, scull's Memory Usage
struct termios 结构(tty 驱动程序)、struct termiosstruct termios
struct termios structure (tty drivers), struct termiosstruct termios
struct timeval 指针,了解当前时间
struct timeval pointer, Knowing the Current Time
struct tty_flip_buffer结构体,没有read函数?
struct tty_flip_buffer structure, No read Function?
struct urb 结构, struct urb
struct urb structure, struct urb
struct usb_device *dev 字段(USB),struct urb
struct usb_device *dev field (USB), struct urb
struct usb_device_id 结构(USB),驱动程序支持哪些设备?
struct usb_device_id structure (USB), What Devices Does the Driver Support?
struct usb_driver结构体,注册USB驱动程序
struct usb_driver structure, Registering a USB Driver
struct usb_host_interface *altsetting 字段(USB),接口
struct usb_host_interface *altsetting field (USB), Interfaces
struct usb_host_interface *cur_altsetting 字段(USB),接口
struct usb_host_interface *cur_altsetting field (USB), Interfaces
struct usb_interface结构体、探测和断开详细信息
struct usb_interface structure, probe and disconnect in Detail
struct usb_iso_packet_descriptor iso_frame_desc 字段(USB),struct urb
struct usb_iso_packet_descriptor iso_frame_desc field (USB), struct urb
结构、一些重要的数据结构文件操作文件操作Char 设备注册旧方法Char 设备注册scull 的内存使用情况struct urb注册 USB 驱动程序详细探测和断开连接Kobjects、Ksets 和子系统子系统二进制属性热插拔操作总线设备结构嵌入驱动程序结构嵌入驱动程序结构嵌入vm_area_struct 结构vm_area_struct 结构mmap 设备操作直接内存访问gendisk 结构简单请求方法bio 结构使用 BIOS设备注册初始化每个设备net_device 结构详细信息硬件信息重要字段自定义 ioctl 命令统计信息内核对多播的支持, struct termiosstruct termios ,没有读取函数?, ioctl , tty_driver 结构详细信息, tty_operations 结构详细信息, tty_struct 结构详细信息
structures, Some Important Data Structures, File OperationsFile Operations, Char Device RegistrationThe Older Way, Char Device Registration, scull's Memory Usage, struct urb, Registering a USB Driver, probe and disconnect in Detail, Kobjects, Ksets, and SubsystemsSubsystems, Binary Attributes, Hotplug Operations, Buses, Device structure embedding, Driver structure embedding, Driver structure embedding, The vm_area_struct structure, The vm_area_struct structure, The mmap Device Operation, Direct Memory Access, The gendisk structure, A Simple request Method, The bio structure, Working with bios, Device Registration, Initializing Each Device, The net_device Structure in DetailHardware Information, The Important Fields, Custom ioctl Commands, Statistical Information, Kernel Support for Multicasting, struct termiosstruct termios, No read Function?, ioctls, The tty_driver Structure in Detail, The tty_operations Structure in Detail, The tty_struct Structure in Detail
bin_attribute,二进制属性
bio、bio 结构使用bios
总线类型,总线
cdev配置,Char设备注册
数据、一些重要的数据结构文件操作文件操作
设备,设备结构嵌入
dev_mc_list,内核对多播的支持
驱动程序、驱动程序结构嵌入
file_operations(mmap方法和),mmap设备操作
gendisk, gendisk 结构
ifreq,自定义 ioctl 命令
kobjects、Kobjects、Ksets 和子系统子系统
kset_hotplug_ops,热插拔操作
ldd_driver,驱动程序结构嵌入
net_device、设备注册net_device 结构详细信息–硬件信息
net_device_stats,初始化各个设备统计信息
注册,Char 设备注册旧方法
分散列表,直接内存访问
Serial_icounter_struct,ioctls
sk_buff,重要字段
struct request,一个简单的请求方法
struct scull_qset,scull的内存使用情况
struct termios(tty 驱动程序)、struct termiosstruct termios
struct tty_flip_buffer,没有读取函数?
结构 urb,结构 urb
struct usb_driver,注册USB驱动程序
struct usb_interface、探测和断开详细信息
tty_driver,tty_driver 结构详细信息
tty_operations,tty_operations 结构详细信息
tty_struct,tty_struct 结构详细信息
vm_area_struct,vm_area_struct结构
vm_operations_struct,vm_area_struct 结构
bin_attribute, Binary Attributes
bio, The bio structure, Working with bios
bus_type, Buses
cdev configuration, Char Device Registration
data, Some Important Data Structures, File OperationsFile Operations
devices, Device structure embedding
dev_mc_list, Kernel Support for Multicasting
drivers, Driver structure embedding
file_operations (mmap method and), The mmap Device Operation
gendisk, The gendisk structure
ifreq, Custom ioctl Commands
kobjects, Kobjects, Ksets, and SubsystemsSubsystems
kset_hotplug_ops, Hotplug Operations
ldd_driver, Driver structure embedding
net_device, Device Registration, The net_device Structure in DetailHardware Information
net_device_stats, Initializing Each Device, Statistical Information
registration, Char Device RegistrationThe Older Way
scatterlist, Direct Memory Access
serial_icounter_struct, ioctls
sk_buff, The Important Fields
struct request, A Simple request Method
struct scull_qset, scull's Memory Usage
struct termios (tty drivers), struct termiosstruct termios
struct tty_flip_buffer, No read Function?
struct urb, struct urb
struct usb_driver, Registering a USB Driver
struct usb_interface, probe and disconnect in Detail
tty_driver, The tty_driver Structure in Detail
tty_operations, The tty_operations Structure in Detail
tty_struct, The tty_struct Structure in Detail
vm_area_struct, The vm_area_struct structure
vm_operations_struct, The vm_area_struct structure
提交 urbs,提交 Urbs ,提交和控制 Urb
submission of urbs, Submitting Urbs, Submitting and Controlling a Urb
SUBSYSTEM 变量,/sbin/hotplug 实用程序
SUBSYSTEM variable, The /sbin/hotplug Utility
子系统、拆分内核设备和模块类内核符号表配置寄存器和初始化配置寄存器和初始化USB 驱动程序Kobject 层次结构、Kset 和子系统子系统类接口工作原理
subsystems, Splitting the Kernel, Classes of Devices and Modules, The Kernel Symbol Table, Configuration Registers and Initialization, Configuration Registers and Initialization, USB Drivers, Kobject Hierarchies, Ksets, and Subsystems, Subsystems, Class interfaces, How It Works
类、类接口
deviceID 寄存器 (PCI)、配置寄存器和初始化
固件,工作原理
kset、子系统
内存管理,分割内核
模块堆叠,内核符号表
USB、设备和模块类别USB 驱动程序(请参阅 USB)
供应商 ID 寄存器 (PCI)、配置寄存器和初始化
classes, Class interfaces
deviceID register (PCI), Configuration Registers and Initialization
firmware, How It Works
ksets, Subsystems
memory management, Splitting the Kernel
module stacking, The Kernel Symbol Table
USB, Classes of Devices and Modules, USB Drivers (see USB)
vendorID register (PCI), Configuration Registers and Initialization
Super-H架构,平台依赖性
Super-H architecture, Platform Dependencies
管理模式、用户空间和内核空间用户空间和内核空间
supervisor mode, User Space and Kernel Space, User Space and Kernel Space
支持、内核中的调试支持内核中的调试支持多播内核支持媒体独立接口支持Ethtool 支持
support, Debugging Support in the KernelDebugging Support in the Kernel, Kernel Support for Multicasting, Media Independent Interface Support, Ethtool Support
Ethtool,Ethtool 支持
内核(调试),内核中的调试支持内核中的调试支持
MII,媒体独立接口支持
多播,多播的内核支持
Ethtool, Ethtool Support
kernels (debugging), Debugging Support in the KernelDebugging Support in the Kernel
MII, Media Independent Interface Support
multicasting, Kernel Support for Multicasting
交换器,产生处理器
swappers, Yielding the processor
switch 语句、ioctl返回值
switch statements, ioctl, The Return Value
返回值,返回值
使用ioctl方法,ioctl
return values, The Return Value
with ioctl method, ioctl
符号链接(kobjects),符号链接
symbolic links (kobjects), Symbolic Links
符号、内核符号表内核符号表打开和关闭消息自行探测快速参考mmap 设备操作设置流 DMA 映射设置流 DMA 映射设置流 DMA映射,设置流 DMA 映射,设置流 DMA 映射,设置流 DMA 映射,直接内存访问,队列控制功能,数据包接收,内核对多播的支持
symbols, The Kernel Symbol TableThe Kernel Symbol Table, Turning the Messages On and Off, Do-it-yourself probing, Quick Reference, The mmap Device Operation, Setting up streaming DMA mappings, Setting up streaming DMA mappings, Setting up streaming DMA mappings, Setting up streaming DMA mappings, Setting up streaming DMA mappings, Setting up streaming DMA mappings, Direct Memory Access, Queue control functions, Packet Reception, Kernel Support for Multicasting
BLK_BOUNCE_HIGH,队列控制函数
字节,快速参考
校验和、数据包接收
DMA_BIDIRECTIONAL,设置流 DMA 映射
DMA_FROM_DEVICE,设置流 DMA 映射
DMA_NONE,设置流 DMA 映射
DMA_TO_DEVICE,设置流 DMA 映射直接内存访问
IFF_,内核对多播的支持
NR_IRQS,自己动手探测
PAGE_SIZE,mmap 设备操作
PCI_DMA_FROMDEVICE,设置流 DMA 映射
PCI_DMA_TODEVICE,设置流 DMA 映射
PDEBUG/PDEBUGG,打开和关闭消息
BLK_BOUNCE_HIGH, Queue control functions
bytes, Quick Reference
CHECKSUM, Packet Reception
DMA_BIDIRECTIONAL, Setting up streaming DMA mappings
DMA_FROM_DEVICE, Setting up streaming DMA mappings
DMA_NONE, Setting up streaming DMA mappings
DMA_TO_DEVICE, Setting up streaming DMA mappings, Direct Memory Access
IFF_, Kernel Support for Multicasting
NR_IRQS, Do-it-yourself probing
PAGE_SIZE, The mmap Device Operation
PCI_DMA_FROMDEVICE, Setting up streaming DMA mappings
PCI_DMA_TODEVICE, Setting up streaming DMA mappings
PDEBUG/PDEBUGG, Turning the Messages On and Off
对称多处理器(SMP)系统,内核中的并发性
symmetric multiprocessor (SMP) systems, Concurrency in the Kernel
同步、完成PCI 双地址周期映射
synchronization, Completions, PCI double-address cycle mappings
DMA 缓冲区、PCI 双地址周期映射
信号量、完成
DMA buffers, PCI double-address cycle mappings
semaphores, Completions
sysfs 目录、USB 和 SysfsUSB 和 Sysfsstruct termios
sysfs directory, USB and SysfsUSB and Sysfs, struct termios
树 (USB)、USB 和 SysfsUSB 和 Sysfs
tty 驱动程序,结构 termios
trees (USB), USB and SysfsUSB and Sysfs
tty driver, struct termios
sysfs 文件系统、低级 Sysfs 操作符号链接Sysfs 操作
sysfs filesystem, Low-Level Sysfs OperationsSymbolic Links, Sysfs Operations
低级操作,低级 Sysfs 操作符号链接
low-level operations, Low-Level Sysfs OperationsSymbolic Links
syslogd 守护进程,如何记录消息
syslogd daemon, How Messages Get Logged
sysrq 操作,系统挂起
sysrq operations, System Hangs
sysrq.txt 文件,系统挂起
sysrq.txt file, System Hangs
系统调用、加载和卸载模块
system calls, Loading and Unloading Modules
系统故障、内核模块与应用程序调试系统故障
system faults, Kernel Modules Versus Applications, Debugging System Faults
调试,调试系统故障
处理、内核模块与应用程序
debugging, Debugging System Faults
handling, Kernel Modules Versus Applications
系统挂起,系统挂起系统挂起
system hangs, System HangsSystem Hangs
系统关闭,Linux 设备模型
system shutdown, The Linux Device Model
sys_syslog 函数、printk
sys_syslog function, printk

时间

T

_t 数据类型,接口特定类型
_t data types, Interface-Specific Types
表页、使用 I/O 内存页表使用 nopage 映射内存
table pages, Using I/O Memory, Page Tables, Mapping Memory with nopage
I/O 内存以及使用 I/O 内存
nopage VMA方法,用nopage映射内存
I/O memory and, Using I/O Memory
nopage VMA method, Mapping Memory with nopage
表、符号、内核符号表内核符号表
tables, symbols, The Kernel Symbol TableThe Kernel Symbol Table
标记命令队列(TCQ),标记命令队列标记命令队列
tagged command queuing (TCQ), Tagged Command QueueingTagged Command Queueing
标记初始化格式、文件操作
tagged initialization formats, File Operations
小任务,小任务小任务小任务小任务
tasklets, TaskletsTasklets, Tasklets, Tasklets
中断处理程序、Tasklet
interrupt handlers, Tasklets
tasklet_schedule 函数,Tasklet
tasklet_schedule function, Tasklets
tcpdump 程序,数据包的物理传输
tcpdump program, The Physical Transport of Packets
TCQ(标记命令队列),Tagged Command Queuing标记命令队列
TCQ (tagged command queueing), Tagged Command QueueingTagged Command Queueing
拆除单页流映射,单页流映射
tearing down single-page streaming mappings, Single-page streaming mappings
模板,双桨(设计),双桨的设计
templates, scull (design of), The Design of scull
终端、选择消息、重定向控制台消息
terminals, selecting for messages, Redirecting Console Messages
termios 用户空间函数,set_termios
termios userspace functions, set_termios
测试系统设置,设置您的测试系统
test system setup, Setting Up Your Test System
测试、Hello World 模块使用新设备测试 Scullpipe 驱动程序在 sbull 中初始化
testing, The Hello World Module, Playing with the New Devices, Testing the Scullpipe Driver, Initialization in sbull
块驱动程序,在 sbull 中初始化
字符驱动程序,使用新设备
你好世界模块,你好世界模块
scullpipe 驱动程序,测试 Scullpipe 驱动程序
block drivers, Initialization in sbull
char drivers, Playing with the New Devices
hello world modules, The Hello World Module
scullpipe drivers, Testing the Scullpipe Driver
test_and_change_bit 操作,位操作
test_and_change_bit operation, Bit Operations
test_and_clear_bit 操作,位操作
test_and_clear_bit operations, Bit Operations
test_and_set_bit 操作,位操作
test_and_set_bit operation, Bit Operations
test_bit 操作,位操作
test_bit operation, Bit Operations
线程执行、并发及其管理
thread execution, Concurrency and Its Management
吞吐量 (DMA)、直接内存访问与 DMA 控制器对话
throughput (DMA), Direct Memory AccessTalking to the DMA controller
时间、测量时间间隔-特定于处理器的寄存器测量时间间隔了解当前时间-了解当前时间延迟执行-短延迟内核定时器的实现 Tasklet - Tasklet工作队列-共享队列计时延迟,时间间隔,时间间隔,时间间隔,启动时间
time, Measuring Time LapsesProcessor-Specific Registers, Measuring Time Lapses, Knowing the Current TimeKnowing the Current Time, Delaying ExecutionShort Delays, The Implementation of Kernel Timers, TaskletsTasklets, WorkqueuesThe Shared Queue, Timekeeping, Delays, Time Intervals, Time Intervals, Time Intervals, Boot Time
启动 (PCI)、启动时间
当前时间(检索),了解当前时间了解当前时间
代码执行(延迟)、延迟执行短延迟延迟
HZ(时间频率)、测量时间间隔时间间隔
(数据类型可移植性)的间隔,时间间隔
内核定时器,内核定时器的实现
失误(测量),测量时间流逝-处理器特定的寄存器
小任务,小任务小任务
内核中的时间间隔,时间间隔
workqueues, Workqueues共享队列
boot (PCI), Boot Time
current time (retrieving), Knowing the Current TimeKnowing the Current Time
execution of code (delaying), Delaying ExecutionShort Delays, Delays
HZ (time frequency), Measuring Time Lapses, Time Intervals
intervals of (data type portability), Time Intervals
kernel timers, The Implementation of Kernel Timers
lapses (measurement of), Measuring Time LapsesProcessor-Specific Registers
tasklets, TaskletsTasklets
time intervals in the kernel, Time Intervals
workqueues, WorkqueuesThe Shared Queue
超时、超时超时初始化每个设备
timeouts, Timeouts, Timeouts, Initializing Each Device
配置、超时
调度、超时
传输,初始化每个设备(请参阅传输超时)
configuration, Timeouts
scheduling, Timeouts
transmission, Initializing Each Device (see transmission timeouts)
timer.h 头文件,定时器 API
timer.h header file, The Timer API
计时器、测量时间间隔内核计时器内核计时器的实现内核计时器的实现内核计时器
timers, Measuring Time Lapses, Kernel TimersThe Implementation of Kernel Timers, The Implementation of Kernel Timers, Kernel Timers
中断,测量时间流逝
内核、内核定时器内核定时器的实现内核定时器
interrupts, Measuring Time Lapses
kernels, Kernel TimersThe Implementation of Kernel Timers, Kernel Timers
定时器列表结构,定时器 API
timer_list structure, The Timer API
时间戳计数器(TSC),处理器特定寄存器
timestamp counter (TSC), Processor-Specific Registers
小关闭函数,打开和关闭
tiny_close function, open and close
tiny_tty_driver 变量, struct termios
tiny_tty_driver variable, struct termios
TIOCLINUX 命令,重定向控制台消息
TIOCLINUX command, Redirecting Console Messages
tiocmget 函数、tiocmget 和 tiocmset
tiocmget function, tiocmget and tiocmset
tiocmset 函数、tiocmget 和 tiocmset
tiocmset functions, tiocmget and tiocmset
令牌环网络、设置接口、接口信息
token ring networks, setting up interfaces for, Interface Information
工具、内核中的调试支持-内核中的调试支持调试器和相关工具-动态探针调试器和相关工具细粒度锁定与粗粒度锁定测试 Scullpipe 驱动程序内核计时器-内核计时器的实现/sbin/hotplug 实用程序Ethtool 支持
tools, Debugging Support in the KernelDebugging Support in the Kernel, Debuggers and Related ToolsDynamic Probes, Debuggers and Related Tools, Fine- Versus Coarse-Grained Locking, Testing the Scullpipe Driver, Kernel TimersThe Implementation of Kernel Timers, The /sbin/hotplug Utility, Ethtool Support
(另请参见调试;实用程序)
调试器、调试器和相关工具动态探针
Ethtool,Ethtool 支持
内核(启用配置选项),内核中的调试支持内核中的调试支持
锁定计,细粒度锁定与粗粒度锁定
/sbin/hotplug 实用程序,/sbin/hotplug 实用程序
斯特雷斯,测试 Scullpipe 驱动程序
定时器、内核定时器内核定时器的实现
(see also debugging; utilities)
debuggers, Debuggers and Related ToolsDynamic Probes
Ethtool, Ethtool Support
kernels (enabling configuration options), Debugging Support in the KernelDebugging Support in the Kernel
lockmeter, Fine- Versus Coarse-Grained Locking
/sbin/hotplug utility, The /sbin/hotplug Utility
strace, Testing the Scullpipe Driver
timers, Kernel TimersThe Implementation of Kernel Timers
上半部(中断处理程序)、上半部和下半部工作队列
top halves (interrupt handlers), Top and Bottom HalvesWorkqueues
跟踪程序, Linux 跟踪工具包
tracing programs, The Linux Trace Toolkit
跟踪、初始化期间的错误处理scull 的内存使用情况
tracking, Error Handling During Initialization, scull's Memory Usage
登记,初始化期间的错误处理
struct scull_qset(结构体),scull的内存使用情况
registration, Error Handling During Initialization
struct scull_qset (structure), scull's Memory Usage
传输、无 Urbs 的 USB 传输其他 USB 数据功能直接内存访问与 DMA 控制器对话设置流 DMA 映射直接内存访问
transfers, USB Transfers Without UrbsOther USB Data Functions, Direct Memory AccessTalking to the DMA controller, Setting up streaming DMA mappings, Direct Memory Access
缓冲区,设置流 DMA 映射
DMA,直接内存访问与 DMA 控制器对话直接内存访问
USB without urbs、USB 传输不带 Urbs其他 USB 数据功能
buffers, Setting up streaming DMA mappings
DMA, Direct Memory AccessTalking to the DMA controller, Direct Memory Access
USB without urbs, USB Transfers Without UrbsOther USB Data Functions
晶体管-晶体管逻辑 (TTL) 电平,并行端口概述
transistor-transistor logic (TTL) levels, An Overview of the Parallel Port
传输并发、控制、控制传输并发
transmission concurrency, controlling, Controlling Transmission Concurrency
数据包传输、数据包物理传输数据包传输传输超时
transmission of packets, The Physical Transport of Packets, Packet TransmissionTransmission Timeouts
传输超时、初始化每个设备设备方法实用程序字段传输超时
transmission timeouts, Initializing Each Device, The Device Methods, Utility Fields, Transmission Timeouts
tx_timeout 方法和设备方法
watchdog_timeo 字段和实用程序字段
tx_timeout method and, The Device Methods
watchdog_timeo field and, Utility Fields
陷阱(锁定),锁定陷阱细粒度锁定与粗粒度锁定
traps (locking), Locking TrapsFine- Versus Coarse-Grained Locking
链表的遍历, Linked Lists
traversal of linked lists, Linked Lists
树、USB 和 SysfsUSB 和 SysfsudevTTY 驱动程序
trees, USB and SysfsUSB and Sysfs, udev, TTY Drivers
/dev,udev
sysfs(USB 和)、USB 和 SysfsUSB 和 Sysfs
tty 驱动程序、TTY 驱动程序
/dev, udev
sysfs (USB and), USB and SysfsUSB and Sysfs
tty drivers, TTY Drivers
故障排除、调试技术系统挂起scull 中的陷阱锁定陷阱-细粒度锁定与粗粒度锁定无 ioctl 的设备控制I/O 寄存器和常规内存I/O 寄存器和常规内存平台依赖性指针和错误值使用 remap_pfn_rangeDIY 分配处理困难的硬件DMA 映射
troubleshooting, Debugging Techniques, System Hangs, Pitfalls in scull, Locking TrapsFine- Versus Coarse-Grained Locking, Device Control Without ioctl, I/O Registers and Conventional Memory, I/O Registers and Conventional Memory, Platform Dependencies, Pointers and Error Values, Using remap_pfn_range, Do-it-yourself allocation, Dealing with difficult hardware, DMA mappings
缓存、I/O 寄存器和传统内存I/O 寄存器和传统内存使用 remap_pfn_rangeDMA 映射
DMA硬件,处理困难的硬件
分片、DIY分配
锁定、锁定陷阱细粒度锁定与粗粒度锁定
内存(scull),scull 中的陷阱
移植问题、平台依赖性
系统挂起,系统挂起
值、指针和错误值
控制台字体错误,设备控制没有 ioctl
caches, I/O Registers and Conventional Memory, I/O Registers and Conventional Memory, Using remap_pfn_range, DMA mappings
DMA hardware, Dealing with difficult hardware
fragmentation, Do-it-yourself allocation
locking, Locking TrapsFine- Versus Coarse-Grained Locking
memory (scull), Pitfalls in scull
porting problems, Platform Dependencies
system hangs, System Hangs
values, Pointers and Error Values
wrong font on console, Device Control Without ioctl
打开时截断设备,open 方法
truncating devices on open, The open Method
tr_configure函数,接口信息
tr_configure function, Interface Information
TSC,处理器特定寄存器
TSC, Processor-Specific Registers
TTL(晶体管-晶体管逻辑)电平、并行端口概述并行端口概述
TTL (transistor-transistor logic) levels, An Overview of the Parallel Port, An Overview of the Parallel Port
tty 驱动程序、TTY 驱动程序小型 TTY 驱动程序struct termiosstruct termiosstruct termiostty_driver 函数指针未读取函数?,其他缓冲函数, TTY 线路设置, TTY 设备的 proc 和 sysfs 处理, tty_driver 结构详细信息, tty_operations 结构详细信息, tty_struct 结构详细信息,快速参考
tty drivers, TTY DriversA Small TTY Driver, struct termiosstruct termios, struct termios, tty_driver Function PointersNo read Function?, Other Buffering Functions, TTY Line Settings, proc and sysfs Handling of TTY Devices, The tty_driver Structure in Detail, The tty_operations Structure in Detail, The tty_struct Structure in Detail, Quick Reference
缓冲区、其他缓冲功能
TTY 设备的目录、 proc 和 sysfs 处理
功能,快速参考
线路设置、TTY 线路设置
指针、tty_driver 函数指针没有读取函数?
结构 termios,结构 termios结构 termios
sysfs 目录,struct termios
tty_driver 结构, tty_driver 结构详细信息
tty_operations 结构,tty_operations 结构详细信息
tty_struct 结构, tty_struct 结构详细信息
buffers, Other Buffering Functions
directories, proc and sysfs Handling of TTY Devices
functions, Quick Reference
line settings, TTY Line Settings
pointers, tty_driver Function PointersNo read Function?
struct termios, struct termiosstruct termios
sysfs directories, struct termios
tty_driver structure, The tty_driver Structure in Detail
tty_operations structure, The tty_operations Structure in Detail
tty_struct structure, The tty_struct Structure in Detail
tty_driver 结构, tty_driver 结构详细信息, tty_operations 结构详细信息, tty_struct 结构详细信息
tty_driver structure, The tty_driver Structure in Detail, The tty_operations Structure in Detail, The tty_struct Structure in Detail
TTY_DRIVER_NO_DEVFS 标志,struct termios
TTY_DRIVER_NO_DEVFS flag, struct termios
TTY_DRIVER_REAL_RAW 标志,struct termios
TTY_DRIVER_REAL_RAW flag, struct termios
TTY_DRIVER_RESET_TERMIOS 标志,struct termios
TTY_DRIVER_RESET_TERMIOS flag, struct termios
tty_get_baud_rate 函数、set_termios
tty_get_baud_rate function, set_termios
tty_register_driver 函数,一个小型 TTY 驱动程序
tty_register_driver function, A Small TTY Driver
tunlp 程序,设备驱动程序的作用,设备驱动程序的作用
tunelp program, The Role of the Device Driver, The Role of the Device Driver
打开/关闭消息,打开和关闭消息
turning messages on/off, Turning the Messages On and Off
tx_timeout 方法、设备方法传输超时
tx_timeout method, The Device Methods, Transmission Timeouts
类型变量,USB
TYPE variable, USB
类型、模块参数快速参考总线属性地址类型
types, Module Parameters, Quick Reference, Bus attributes, Address Types
地址、地址类型
bus_attribute,总线属性
模块参数支持,模块参数
PCI 驱动程序支持,快速参考
addresses, Address Types
bus_attribute, Bus attributes
module parameter support, Module Parameters
PCI driver support, Quick Reference

U

U

u16 bcdDevice_hi 字段(USB),驱动程序支持哪些设备?
u16 bcdDevice_hi field (USB), What Devices Does the Driver Support?
u16 bcdDevice_lo 字段(USB),驱动程序支持哪些设备?
u16 bcdDevice_lo field (USB), What Devices Does the Driver Support?
u16 idProduct 字段(USB),驱动程序支持哪些设备?
u16 idProduct field (USB), What Devices Does the Driver Support?
u16 idVendor字段(USB),驱动程序支持哪些设备?
u16 idVendor field (USB), What Devices Does the Driver Support?
u16 match_flags 字段(USB),驱动程序支持哪些设备?
u16 match_flags field (USB), What Devices Does the Driver Support?
u8 bDeviceClass 字段(USB),驱动程序支持哪些设备?
u8 bDeviceClass field (USB), What Devices Does the Driver Support?
u8 bDeviceProtocol 字段(USB),驱动程序支持哪些设备?
u8 bDeviceProtocol field (USB), What Devices Does the Driver Support?
u8 bDeviceSubClass 字段(USB),驱动程序支持哪些设备?
u8 bDeviceSubClass field (USB), What Devices Does the Driver Support?
u8 bInterfaceClass 字段(USB),驱动程序支持哪些设备?
u8 bInterfaceClass field (USB), What Devices Does the Driver Support?
u8 bInterfaceProtocol 字段(USB),驱动程序支持哪些设备?
u8 bInterfaceProtocol field (USB), What Devices Does the Driver Support?
u8 bInterfaceSubClass 字段(USB),驱动程序支持哪些设备?
u8 bInterfaceSubClass field (USB), What Devices Does the Driver Support?
u8、u16、u32、u64 数据类型,为数据项分配显式大小
u8, u16, u32, u64 data types, Assigning an Explicit Size to Data Items
uaccess.h 头文件,读写快速参考使用 ioctl 参数快速参考
uaccess.h header file, read and write, Quick Reference, Using the ioctl Argument, Quick Reference
udelay,短延迟
udelay, Short Delays
uint8_t/uint32_t 类型,为数据项分配显式大小
uint8_t/uint32_t types, Assigning an Explicit Size to Data Items
uintptr_t 类型(C99 标准),标准 C 类型的使用
uintptr_t type (C99 standard), Use of Standard C Types
未对齐数据,快速参考
unaligned data, Quick Reference
访问、快速参考
access, Quick Reference
unaligned.h 头文件,数据对齐
unaligned.h header file, Data Alignment
单向管道(USB 端点)、端点
unidirectional pipes (USB endpoints), Endpoints
单处理器系统,并发性,内核中的并发性
uniprocessor systems, concurrency in, Concurrency in the Kernel
通用串行总线,设备和模块类别(参见 USB)
universal serial bus, Classes of Devices and Modules (see USB)
Unix,分割内核设备和模块的类别
Unix, Splitting the Kernel, Classes of Devices and Modules
文件系统,分割内核
接口(访问)、设备和模块类别
filesystems, Splitting the Kernel
interfaces (access to), Classes of Devices and Modules
取消城市链接、取消城市
unlinking urbs, Canceling Urbs
卸载、内核模块与应用程序加载和卸载模块注册 USB 驱动程序模块卸载
unloading, Kernel Modules Versus Applications, Loading and Unloading Modules, Registering a USB Driver, Module Unloading
模块、内核模块与应用程序加载和卸载模块模块卸载
USB 驱动程序,注册 USB 驱动程序
modules, Kernel Modules Versus Applications, Loading and Unloading Modules, Module Unloading
USB drivers, Registering a USB Driver
解锁信号量、信号量和互斥体
unlocking semaphores, Semaphores and Mutexes
取消映射 DMA 缓冲区、设置流 DMA 映射设置流 DMA 映射
unmapping DMA buffers, Setting up streaming DMA mappings, Setting up streaming DMA mappings
(另请参阅映射)
(see also mapping)
取消注册设施,初始化期间的错误处理
unregistering facilities, Error Handling During Initialization
unregister_netdev函数,模块卸载
unregister_netdev function, Module Unloading
非屏蔽双绞线 (UTP),接口信息
unshielded twisted pair (UTP), Interface Information
unsigned char *setup_packet 字段(USB),结构 urb
unsigned char *setup_packet field (USB), struct urb
unsigned int bi_size字段(bio结构),bio结构
unsigned int bi_size field (bio structure), The bio structure
unsigned int f_flags (struct file field),文件结构
unsigned int f_flags (struct file field), The file Structure
unsigned int irq 函数,安装中断处理程序
unsigned int irq function, Installing an Interrupt Handler
无符号整型管道字段 (USB),结构 urb
unsigned int pipe field (USB), struct urb
无符号整型 Transfer_flags 字段(USB),结构 urb
unsigned int transfer_flags field (USB), struct urb
unsigned long bi_flags字段(bio结构),bio结构
unsigned long bi_flags field (bio structure), The bio structure
无符号长标志字段(内存),内存映射和结构页
unsigned long flags field (memory), The Memory Map and Struct Page
unsigned long flags 函数,安装中断处理程序
unsigned long flags function, Installing an Interrupt Handler
unsigned long 方法,文件操作
unsigned long method, File Operations
unsigned long nr_sectors 字段(请求结构),一个简单的请求方法
unsigned long nr_sectors field (request structure), A Simple request Method
unsigned long pci_resource_end 函数,访问 I/O 和内存空间
unsigned long pci_resource_end function, Accessing the I/O and Memory Spaces
unsigned long pci_resource_flags 函数,访问 I/O 和内存空间
unsigned long pci_resource_flags function, Accessing the I/O and Memory Spaces
unsigned long pci_resource_start 函数,访问 I/O 和内存空间
unsigned long pci_resource_start function, Accessing the I/O and Memory Spaces
无符号长状态字段(net_device 结构),全局信息
unsigned long state field (net_device structure), Global Information
无符号 num_altsetting 字段 (USB),接口
unsigned num_altsetting field (USB), Interfaces
unsigned Short bio_hw_segments 字段(bio 结构),bio 结构
unsigned short bio_hw_segments field (bio structure), The bio structure
unsigned short bio_phys_segments 字段(bio 结构),bio 结构
unsigned short bio_phys_segments field (bio structure), The bio structure
无符号类型,操作 I/O 端口
unsigned type, Manipulating I/O ports
up 函数,Linux 信号量实现
up function, The Linux Semaphore Implementation
更新、RCU、读取-复制-更新
updates, RCU, Read-Copy-Update
urandom 设备,安装中断处理程序
urandom device, Installing an Interrupt Handler
urbs、USB Urbs取消 Urbsstruct urb创建和销毁 Urbs中断urbs提交 Urbs取消 Urbs、取消 Urbs 、取消 Urbs提交和控制 Urb无 Urbs 的 USB 传输其他 USB 数据功能
urbs, USB UrbsCanceling Urbs, struct urb, Creating and Destroying Urbs, Interrupt urbs, Submitting Urbs, Canceling Urbs, Canceling Urbs, Canceling Urbs, Submitting and Controlling a Urb, USB Transfers Without UrbsOther USB Data Functions
取消,取消城市
中断、中断 urb
杀戮、取消城市
正在提交,正在提交 Urbs
取消链接、取消 Urbs
USB、USB Urbs取消 Urbsstruct urb创建和销毁 Urbs提交和控制 Urb无 Urbs 的 USB 传输其他 USB 数据功能
创建/销毁,创建和销毁城市
struct urb 结构, struct urb
提交、提交和控制城市
无 Urbs 传输、无 Urbs USB 传输其他 USB 数据功能
cancellation of, Canceling Urbs
interrupts, Interrupt urbs
killing, Canceling Urbs
submitting, Submitting Urbs
unlinking, Canceling Urbs
USB, USB UrbsCanceling Urbs, struct urb, Creating and Destroying Urbs, Submitting and Controlling a Urb, USB Transfers Without UrbsOther USB Data Functions
creating/destroying, Creating and Destroying Urbs
struct urb structure, struct urb
submitting, Submitting and Controlling a Urb
transfers without, USB Transfers Without UrbsOther USB Data Functions
urbs_completion 函数,完成 Urbs:完成回调处理程序
urbs_completion function, Completing Urbs: The Completion Callback Handler
使用计数、open 方法release 方法添加 VMA 操作使用 nopage 方法重新映射 RAM
usage count, The open Method, The release Method, Adding VMA Operations, Remapping RAM with the nopage method
通过release方法递减,release方法
通过 open 方法递增,open 方法
nopage 方法以及使用 nopage 方法重新映射 RAM
decremented by release method, The release Method
incremented by open method, The open Method
nopage method and, Remapping RAM with the nopage method
USB(通用串行总线)、设备和模块类内核符号表USB 驱动程序-接口配置USB 和 Sysfs - USB 和 SysfsUSB Urbs -取消 Urbs编写 USB 驱动程序-提交和控制 Urb无 Urbs 的 USB 传输其他 USB 数据功能, USB
USB (universal serial bus), Classes of Devices and Modules, The Kernel Symbol Table, USB DriversInterfaces, Configurations, USB and SysfsUSB and Sysfs, USB UrbsCanceling Urbs, Writing a USB DriverSubmitting and Controlling a Urb, USB Transfers Without UrbsOther USB Data Functions, USB
配置,配置
热插拔、USB
堆栈,内核符号表
sysfs 目录树、USB 和 SysfsUSB 和 Sysfs
无 urbs 传输、无 Urbs 的 USB 传输其他 USB 数据功能
urbs、USB Urbs取消 Urbs
编写,编写 USB 驱动程序提交和控制 Urb
configurations, Configurations
hotplugging, USB
stacking, The Kernel Symbol Table
sysfs directory tree, USB and SysfsUSB and Sysfs
transfers without urbs, USB Transfers Without UrbsOther USB Data Functions
urbs, USB UrbsCanceling Urbs
writing, Writing a USB DriverSubmitting and Controlling a Urb
USB 请求块、USB Urbs(请参阅 urbs)
USB request blocks, USB Urbs (see urbs)
usbcore 模块,内核符号表
usbcore module, The Kernel Symbol Table
usb_alloc_urb 函数,创建和销毁 Urb
usb_alloc_urb function, Creating and Destroying Urbs
usb_bulk_msg 函数, usb_bulk_msg
usb_bulk_msg function, usb_bulk_msg
usb_control_msg 函数, usb_control_msg
usb_control_msg function, usb_control_msg
USB_DEVICE宏,驱动程序支持哪些设备?
USB_DEVICE macro, What Devices Does the Driver Support?
USB_DEVICE_INFO 宏,驱动程序支持哪些设备?
USB_DEVICE_INFO macros, What Devices Does the Driver Support?
USB_DEVICE_VER 宏,驱动程序支持哪些设备?
USB_DEVICE_VER macro, What Devices Does the Driver Support?
usb_fill_bulk_urb 函数,批量 urb
usb_fill_bulk_urb function, Bulk urbs
usb_fill_control_urb 函数,控制urbs
usb_fill_control_urb function, Control urbs
usb_fill_int_urb 函数,中断 urbs
usb_fill_int_urb function, Interrupt urbs
usb_get_descriptor 函数,其他 USB 数据函数
usb_get_descriptor function, Other USB Data Functions
USB_INTERFACE_INFO 宏,驱动程序支持哪些设备?
USB_INTERFACE_INFO macro, What Devices Does the Driver Support?
usb_kill_urb 函数,取消 Urbs
usb_kill_urb function, Canceling Urbs
usb_register_dev 函数、探测和断开详细信息
usb_register_dev function, probe and disconnect in Detail
usb_set_intfdata 函数、探测和断开详细信息
usb_set_intfdata function, probe and disconnect in Detail
usb_string 函数,其他 USB 数据函数
usb_string function, Other USB Data Functions
usb_submit_urb 函数,提交 Urbs
usb_submit_urb function, Submitting Urbs
usb_unlink_urb 函数,取消 Urbs
usb_unlink_urb function, Canceling Urbs
user mode, User Space and Kernel Space, User Space and Kernel Space
user mode, User Space and Kernel Space, User Space and Kernel Space
user programs, The Role of the Device Driver
user programs, The Role of the Device Driver
user space, User Space and Kernel Space, User Space and Kernel Space, Doing It in User SpaceDoing It in User Space, Doing It in User Space, read and write, Using the ioctl Argument, Capabilities and Restricted Operations, I/O Port Access from User Space, Assigning an Explicit Size to Data Items, The Linux Device Model, Performing Direct I/OAn asynchronous I/O example, TTY Line Settings
user space, User Space and Kernel Space, User Space and Kernel Space, Doing It in User SpaceDoing It in User Space, Doing It in User Space, read and write, Using the ioctl Argument, Capabilities and Restricted Operations, I/O Port Access from User Space, Assigning an Explicit Size to Data Items, The Linux Device Model, Performing Direct I/OAn asynchronous I/O example, TTY Line Settings
capabilities/restrictions in, Capabilities and Restricted Operations
communication with, The Linux Device Model
direct I/O, Performing Direct I/OAn asynchronous I/O example
explicitly sizing data in, Assigning an Explicit Size to Data Items
I/O port access from, I/O Port Access from User Space
programming, User Space and Kernel Space, Doing It in User SpaceDoing It in User Space
retrieving datum from, Using the ioctl Argument
transferring to/from kernel space, read and write
tty drivers, TTY Line Settings
writing drivers in, Doing It in User Space
capabilities/restrictions in, Capabilities and Restricted Operations
communication with, The Linux Device Model
direct I/O, Performing Direct I/OAn asynchronous I/O example
explicitly sizing data in, Assigning an Explicit Size to Data Items
I/O port access from, I/O Port Access from User Space
programming, User Space and Kernel Space, Doing It in User SpaceDoing It in User Space
retrieving datum from, Using the ioctl Argument
transferring to/from kernel space, read and write
tty drivers, TTY Line Settings
writing drivers in, Doing It in User Space
user virtual addresses, Address Types
user virtual addresses, Address Types
User-Mode Linux, The User-Mode Linux Port
User-Mode Linux, The User-Mode Linux Port
utilities, The Role of the Device Driver, The Hello World Module, The Hello World Module, Loading and Unloading Modules, The Kernel Symbol Table, The Kernel Symbol Table
utilities, The Role of the Device Driver, The Hello World Module, The Hello World Module, Loading and Unloading Modules, The Kernel Symbol Table, The Kernel Symbol Table
(see also programs)
insmod, The Hello World Module
modprobe, Loading and Unloading Modules, The Kernel Symbol Table
rmmod, The Hello World Module
(see also programs)
insmod, The Hello World Module
modprobe, Loading and Unloading Modules, The Kernel Symbol Table
rmmod, The Hello World Module
utility fields (net_device structure), Utility Fields
utility fields (net_device structure), Utility Fields
UTP (unshielded twisted pair), Interface Information
UTP (unshielded twisted pair), Interface Information
UTS_RELEASE macro, Version Dependency, Version Dependency
UTS_RELEASE macro, Version Dependency, Version Dependency

V

V

values, The Return Value, Measuring Time Lapses, Short Delays, Short Delays, Handler Arguments and Return Value, Pointers and Error Values, Utility Fields
values, The Return Value, Measuring Time Lapses, Short Delays, Short Delays, Handler Arguments and Return Value, Pointers and Error Values, Utility Fields
BogoMips, Short Delays
errors, Pointers and Error Values
jiffies, Measuring Time Lapses, Utility Fields
loops_per_jiffy, Short Delays
return, The Return Value, Handler Arguments and Return Value
interrupt handlers, Handler Arguments and Return Value
switch statements, The Return Value
BogoMips, Short Delays
errors, Pointers and Error Values
jiffies, Measuring Time Lapses, Utility Fields
loops_per_jiffy, Short Delays
return, The Return Value, Handler Arguments and Return Value
interrupt handlers, Handler Arguments and Return Value
switch statements, The Return Value
variables, printk, Atomic Variables, Per-CPU VariablesPer-CPU Variables, probe and disconnect in Detail, probe and disconnect in Detail, probe and disconnect in Detail, probe and disconnect in Detail, Add a Device, The /sbin/hotplug Utility, The /sbin/hotplug Utility, The /sbin/hotplug Utility, The /sbin/hotplug Utility, PCI, PCI, PCI, PCI, Input, Input, Input, USB, USB, USB, USB, struct termios
variables, printk, Atomic Variables, Per-CPU VariablesPer-CPU Variables, probe and disconnect in Detail, probe and disconnect in Detail, probe and disconnect in Detail, probe and disconnect in Detail, Add a Device, The /sbin/hotplug Utility, The /sbin/hotplug Utility, The /sbin/hotplug Utility, The /sbin/hotplug Utility, PCI, PCI, PCI, PCI, Input, Input, Input, USB, USB, USB, USB, struct termios
ACTION, The /sbin/hotplug Utility
atomic, Atomic Variables
char*name (USB), probe and disconnect in Detail
console_loglevel, printk
DEVICE, USB
DEVPATH, The /sbin/hotplug Utility
int minor_base (USB), probe and disconnect in Detail
INTERFACE, USB
mode_t mode (USB), probe and disconnect in Detail
NAME, Input
pci_bus_type, Add a Device
PCI_CLASS, PCI
PCI_ID, PCI
PCI_SLOT_NAME, PCI
PCI_SUBSYS_ID, PCI
per-CPU, Per-CPU VariablesPer-CPU Variables
PHYS, Input
PRODUCT, Input, USB
SEQNUM, The /sbin/hotplug Utility
struct file_operations *fops (USB), probe and disconnect in Detail
SUBSYSTEM, The /sbin/hotplug Utility
tiny_tty_driver, struct termios
TYPE, USB
ACTION, The /sbin/hotplug Utility
atomic, Atomic Variables
char*name (USB), probe and disconnect in Detail
console_loglevel, printk
DEVICE, USB
DEVPATH, The /sbin/hotplug Utility
int minor_base (USB), probe and disconnect in Detail
INTERFACE, USB
mode_t mode (USB), probe and disconnect in Detail
NAME, Input
pci_bus_type, Add a Device
PCI_CLASS, PCI
PCI_ID, PCI
PCI_SLOT_NAME, PCI
PCI_SUBSYS_ID, PCI
per-CPU, Per-CPU VariablesPer-CPU Variables
PHYS, Input
PRODUCT, Input, USB
SEQNUM, The /sbin/hotplug Utility
struct file_operations *fops (USB), probe and disconnect in Detail
SUBSYSTEM, The /sbin/hotplug Utility
tiny_tty_driver, struct termios
TYPE, USB
vector operations, char drivers, readv and writev
vector operations, char drivers, readv and writev
vendorID register (PCI), Configuration Registers and Initialization
vendorID register (PCI), Configuration Registers and Initialization
VERIFY_ symbols, Using the ioctl Argument, Quick Reference
VERIFY_ symbols, Using the ioctl Argument, Quick Reference
version dependency, Version Dependency
version dependency, Version Dependency
version.h header file, Version Dependency, Quick Reference
version.h header file, Version Dependency, Quick Reference
versions, Version NumberingVersion Numbering, Version Dependency, Major and Minor Numbers, Major and Minor Numbers, The Internal Representation of Device Numbers, The Older Way
versions, Version NumberingVersion Numbering, Version Dependency, Major and Minor Numbers, Major and Minor Numbers, The Internal Representation of Device Numbers, The Older Way
dependency, Version Dependency
numbering, Version NumberingVersion Numbering, Major and Minor Numbers, Major and Minor Numbers, The Internal Representation of Device Numbers, The Older Way
char drivers, Major and Minor Numbers
major device numbers, The Internal Representation of Device Numbers
minor device numbers, Major and Minor Numbers
older char device registration, The Older Way
dependency, Version Dependency
numbering, Version NumberingVersion Numbering, Major and Minor Numbers, Major and Minor Numbers, The Internal Representation of Device Numbers, The Older Way
char drivers, Major and Minor Numbers
major device numbers, The Internal Representation of Device Numbers
minor device numbers, Major and Minor Numbers
older char device registration, The Older Way
VESA Local Bus (VLB), VLB
VESA Local Bus (VLB), VLB
vfree function, vmalloc and Friends
vfree function, vmalloc and Friends
video memory (mapping), The mmap Device Operation
video memory (mapping), The mmap Device Operation
viewing kernels, Splitting the Kernel
viewing kernels, Splitting the Kernel
virtual addresses, Address Types, Remapping Kernel Virtual Addresses, Remapping Kernel Virtual Addresses, Bus Addresses
virtual addresses, Address Types, Remapping Kernel Virtual Addresses, Remapping Kernel Virtual Addresses, Bus Addresses
(see also addresses)
conversion, Bus Addresses
remapping, Remapping Kernel Virtual Addresses
(see also addresses)
conversion, Bus Addresses
remapping, Remapping Kernel Virtual Addresses
virtual memory, Address Types, Address Types
virtual memory, Address Types, Address Types
(see also memory)
(see also memory)
virtual memory area, Virtual Memory Areas (see VMA)
virtual memory area, Virtual Memory Areas (see VMA)
virt_to_page function, The Memory Map and Struct Page
virt_to_page function, The Memory Map and Struct Page
VLB (VESA Local Bus), VLB
VLB (VESA Local Bus), VLB
VMA (virtual memory area), Virtual Memory AreasThe vm_area_struct structure, Adding VMA Operations
VMA (virtual memory area), Virtual Memory AreasThe vm_area_struct structure, Adding VMA Operations
vmalloc allocation function, vmalloc and FriendsA scull Using Virtual Addresses: scullv
vmalloc allocation function, vmalloc and FriendsA scull Using Virtual Addresses: scullv
vmalloc.h header file, vmalloc and Friends
vmalloc.h header file, vmalloc and Friends
vm_area_struct structure, The vm_area_struct structure
vm_area_struct structure, The vm_area_struct structure
VM_IO flag, The vm_area_struct structure
VM_IO flag, The vm_area_struct structure
vm_operations_struct structure, The vm_area_struct structure
vm_operations_struct structure, The vm_area_struct structure
VM_RESERVED flag, The vm_area_struct structure
VM_RESERVED flag, The vm_area_struct structure
void *context field (USB), struct urb
void *context field (USB), struct urb
void *dev_id function, Installing an Interrupt Handler
void *dev_id function, Installing an Interrupt Handler
void *driver_data field, Devices
void *driver_data field, Devices
void *private_data (struct file field), The file Structure
void *private_data (struct file field), The file Structure
void *private_data field (gendisk), The gendisk structure
void *private_data field (gendisk), The gendisk structure
void *release field, Devices
void *release field, Devices
void *transfer_buffer field (USB), struct urb
void *transfer_buffer field (USB), struct urb
void *virtual field (memory), The Memory Map and Struct Page
void *virtual field (memory), The Memory Map and Struct Page
void barrier function, I/O Registers and Conventional Memory
void barrier function, I/O Registers and Conventional Memory
void blk_queue_bounce_limit function, Queue control functions
void blk_queue_bounce_limit function, Queue control functions
void blk_queue_dma_alignment function, Queue control functions
void blk_queue_dma_alignment function, Queue control functions
void blk_queue_hardsect_size function, Queue control functions
void blk_queue_hardsect_size function, Queue control functions
void blk_queue_max_hw_segments function, Queue control functions
void blk_queue_max_hw_segments function, Queue control functions
void blk_queue_max_phys_segments function, Queue control functions
void blk_queue_max_phys_segments function, Queue control functions
void blk_queue_max_sectors function, Queue control functions
void blk_queue_max_sectors function, Queue control functions
void blk_queue_max_segment_size function, Queue control functions
void blk_queue_max_segment_size function, Queue control functions
void blk_start_queue function, Queue control functions
void blk_start_queue function, Queue control functions
void blk_stop_queue function, Queue control functions
void blk_stop_queue function, Queue control functions
void field (PCI registration), Registering a PCI Driver
void field (PCI registration), Registering a PCI Driver
void function, Registering a USB Driver
void function, Registering a USB Driver
void mb function, I/O Registers and Conventional Memory
void mb function, I/O Registers and Conventional Memory
void read_barrier_depends function, I/O Registers and Conventional Memory
void read_barrier_depends function, I/O Registers and Conventional Memory
void rmb function, I/O Registers and Conventional Memory
void rmb function, I/O Registers and Conventional Memory
void smp_mb functions, I/O Registers and Conventional Memory
void smp_mb functions, I/O Registers and Conventional Memory
void smp_read_barrier_depends function, I/O Registers and Conventional Memory
void smp_read_barrier_depends function, I/O Registers and Conventional Memory
void smp_rmb function, I/O Registers and Conventional Memory
void smp_rmb function, I/O Registers and Conventional Memory
void smp_wmb function, I/O Registers and Conventional Memory
void smp_wmb function, I/O Registers and Conventional Memory
void tasklet_disable function, Tasklets
void tasklet_disable function, Tasklets
void tasklet_disable_nosync function, Tasklets
void tasklet_disable_nosync function, Tasklets
void tasklet_enable function, Tasklets
void tasklet_enable function, Tasklets
void tasklet_hi_schedule function, Tasklets
void tasklet_hi_schedule function, Tasklets
void tasklet_kill function, Tasklets
void tasklet_kill function, Tasklets
void tasklet_schedule function, Tasklets
void tasklet_schedule function, Tasklets
void wmb function, I/O Registers and Conventional Memory
void wmb function, I/O Registers and Conventional Memory

W

W

wait queues, Introduction to Sleeping, Manual sleeps, The Underlying Data Structure, Quick Reference, Quick Reference, Timeouts
wait queues, Introduction to Sleeping, Manual sleeps, The Underlying Data Structure, Quick Reference, Quick Reference, Timeouts
delaying code execution, Timeouts
poll table entries and, The Underlying Data Structure
putting processes into, Quick Reference
delaying code execution, Timeouts
poll table entries and, The Underlying Data Structure
putting processes into, Quick Reference
wait_event macro, Simple Sleeping
wait_event macro, Simple Sleeping
wait_event_interruptible_timeout function, Timeouts
wait_event_interruptible_timeout function, Timeouts
wake_up function, Simple Sleeping, Exclusive waits, Quick Reference, Quick Reference
wake_up function, Simple Sleeping, Exclusive waits, Quick Reference, Quick Reference
wake_up_interruptible function, Quick Reference
wake_up_interruptible function, Quick Reference
wake_up_interruptible_sync function, Quick Reference
wake_up_interruptible_sync function, Quick Reference
wake_up_sync function, Quick Reference
wake_up_sync function, Quick Reference
Wall flag, Interface-Specific Types
Wall flag, Interface-Specific Types
watchdog_timeo field (net_device structure), Utility Fields, Transmission Timeouts
watchdog_timeo field (net_device structure), Utility Fields, Transmission Timeouts
wc command, Debugging by Watching
wc command, Debugging by Watching
wMaxPacketSize field (USB), Endpoints
wMaxPacketSize field (USB), Endpoints
workqueues, WorkqueuesThe Shared Queue, Workqueues, Workqueues
workqueues, WorkqueuesThe Shared Queue, Workqueues, Workqueues
interrupt handlers, Workqueues
interrupt handlers, Workqueues
WQ_FLAG_EXCLUSIVE flag set, Exclusive waits
WQ_FLAG_EXCLUSIVE flag set, Exclusive waits
write function (tty drivers), Flow of Data
write function (tty drivers), Flow of Data
write method, File Operations, The file Structure, The write Method, The write Method, Debugging by Watching, Oops Messages, Writing to the device, Writing to the device, Talking to the DMA controller
write method, File Operations, The file Structure, The write Method, The write Method, Debugging by Watching, Oops Messages, Writing to the device, Writing to the device, Talking to the DMA controller
code for, The write Method
configuring DMA controller, Talking to the DMA controller
f_pos field (file structure) and, The file Structure
oops messages, Oops Messages
poll method and, Writing to the device
return values, rules for interpreting, The write Method
select method and, Writing to the device
strace command and, Debugging by Watching
code for, The write Method
configuring DMA controller, Talking to the DMA controller
f_pos field (file structure) and, The file Structure
oops messages, Oops Messages
poll method and, Writing to the device
return values, rules for interpreting, The write Method
select method and, Writing to the device
strace command and, Debugging by Watching
write system, File Operations
write system, File Operations
write-buffering example, A Write-Buffering Example
write-buffering example, A Write-Buffering Example
writev calls, readv and writev
writev calls, readv and writev
writev method, File Operations
writev method, File Operations
writing, Version Numbering, Doing It in User Space, The write Method, Debugging Techniques, Device Control Without ioctl, Blocking and Nonblocking Operations, Writing a USB DriverSubmitting and Controlling a Urb
writing, Version Numbering, Doing It in User Space, The write Method, Debugging Techniques, Device Control Without ioctl, Blocking and Nonblocking Operations, Writing a USB DriverSubmitting and Controlling a Urb
blocking/nonblocking operations, Blocking and Nonblocking Operations
control sequences to devices, Device Control Without ioctl
to a device, The write Method
drivers, Version Numbering, Doing It in User Space
in user space, Doing It in User Space
version numbering, Version Numbering
UBS drivers, Writing a USB DriverSubmitting and Controlling a Urb
blocking/nonblocking operations, Blocking and Nonblocking Operations
control sequences to devices, Device Control Without ioctl
to a device, The write Method
drivers, Version Numbering, Doing It in User Space
in user space, Doing It in User Space
version numbering, Version Numbering
UBS drivers, Writing a USB DriverSubmitting and Controlling a Urb

Z

Z

zero-order limitations, Remapping RAM with the nopage method
zero-order limitations, Remapping RAM with the nopage method
zones (memory), Memory zones
zones (memory), Memory zones
zSeries architecture, S/390 and zSeries
zSeries architecture, S/390 and zSeries

About the Authors

About the Authors

Jonathan Corbet got his first look at the BSD Unix source back in 1981, when an instructor at the University of Colorado let him "fix" the paging algorithm. He has been digging around inside every system he could get his hands on ever since, working on drivers for VAX, Sun, Ardent, and x86 systems on the way. He got his first Linux system in 1993, and has never looked back. Mr. Corbet is currently the co-founder and executive editor of Linux Weekly News (http://LWN.net/); he lives in Boulder, Colorado with his wife and two children.

Jonathan Corbet got his first look at the BSD Unix source back in 1981, when an instructor at the University of Colorado let him "fix" the paging algorithm. He has been digging around inside every system he could get his hands on ever since, working on drivers for VAX, Sun, Ardent, and x86 systems on the way. He got his first Linux system in 1993, and has never looked back. Mr. Corbet is currently the co-founder and executive editor of Linux Weekly News (http://LWN.net/); he lives in Boulder, Colorado with his wife and two children.

Alessandro installed Linux 0.99.14 soon after getting his degree as electronic engineer. He then received a Ph.D. in computer science at the University of Pavia despite his aversion toward modern technology. He left the University after getting his Ph.D. because he didn't want to write articles. He now works as a free lancer writing device drivers and, um...articles. He used to be a young hacker before his babies were born; he's now an old advocate of Free Software who developed a bias for non-PC computer platforms.

Alessandro installed Linux 0.99.14 soon after getting his degree as electronic engineer. He then received a Ph.D. in computer science at the University of Pavia despite his aversion toward modern technology. He left the University after getting his Ph.D. because he didn't want to write articles. He now works as a free lancer writing device drivers and, um...articles. He used to be a young hacker before his babies were born; he's now an old advocate of Free Software who developed a bias for non-PC computer platforms.

Greg Kroah-Hartman 自 1999 年以来一直在编写 Linux 内核驱动程序,目前是 USB、PCI、I2C、驱动程序核心和 sysfs 内核子系统的维护者。他还是 udev 和 hotplug 用户空间程序的维护者,也是 Gentoo 内核维护者,确保他的电子邮件收件箱永远不会空。他是 Linux Journal Magazine 的特约编辑,并在 IBM 的 Linux 技术中心工作,从事各种与 Linux 内核相关的任务。

Greg Kroah-Hartman has been writing Linux kernel drivers since 1999, and is currently the maintainer for the USB, PCI, I2C, driver core, and sysfs kernel subsystems. He is also the maintainer of the udev and hotplug userspace programs, as well as being a Gentoo kernel maintainer, ensuring that his email inbox is never empty. He is a contributing editor to Linux Journal Magazine, and works for IBM's Linux Technology Center, doing various Linux kernel related tasks.

Linux 设备驱动程序,第三版

Linux Device Drivers, 3rd Edition

乔纳森·科贝特

Jonathan Corbet

亚历山德罗·鲁比尼

Alessandro Rubini

格雷格·克罗哈特曼

Greg Kroah-Hartman

编辑

Editor

安迪·奥拉姆

Andy Oram

奥莱利媒体

格拉文斯坦公路北1005号

1005 Gravenstein Highway North

塞瓦斯托波尔, CA 95472

Sebastopol, CA 95472

2012-08-19T20:15:00-07:00

2012-08-19T20:15:00-07:00